🧠 NeurIPS2025 Paper Notes¶

2534 NeurIPS2025 paper notes covering Image Generation (246), Reinforcement Learning (169), Multimodal VLM (151), Medical Imaging (138), Model Compression (134), 3D Vision (116), Optimization & Theory (113), Interpretability (82) and other 45 areas. Each note has TL;DR, motivation, method, experiments, highlights, and limitations — 5-minute reads of core ideas.

🎨 Image Generation¶

70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float (DFloat11): DFloat11 exploits the low-entropy property of exponent bits in BFloat16 weights to losslessly compress LLMs and diffusion models to approximately 70% of their original size (equivalent to ~11 bits) via Huffman coding. It further introduces hierarchical lookup tables and a two-phase GPU kernel for efficient online decompression, enabling lossless inference of Llama 3.1 405B on a single node with 8×80GB GPUs.
A Closer Look at Model Collapse: From a Generalization-to-Memorization Perspective: This paper identifies a generalization-to-memorization transition in diffusion models under self-consuming loops (where each generation of models is trained on synthetic data from the previous one), reveals a strong linear correlation between training set entropy and model generalization (Pearson $r=0.91$), and proposes entropy-based data selection strategies (Greedy Selection / Threshold Decay Filter) that effectively slow this transition—reducing FID from 75.7 to 44.7 at iteration 8 under the CIFAR-10 accumulate paradigm.
A Connection Between Score Matching and Local Intrinsic Dimension: This paper proves that the lower bound of the denoising score matching (DSM) loss is precisely the local intrinsic dimension (LID) of the data manifold, thereby establishing the DSM loss itself as an efficient LID estimator—requiring neither gradient computation nor multiple forward passes. On Stable Diffusion 3.5, this approach reduces peak memory usage to approximately 60% of FLIPD while yielding more stable estimates under quantization.
A Data-Driven Prism: Multi-View Source Separation with Diffusion Model Priors: This paper proposes DDPRISM, a method that exploits structural differences among linear transformations across multi-view observations. Within an EM framework, it learns an independent diffusion model prior for each unknown source without requiring any isolated source samples, enabling source separation and posterior sampling. DDPRISM outperforms existing methods on both synthetic benchmarks and real galaxy observations.
A Diffusion Model for Regular Time Series Generation from Irregular Data with Completion and Masking: This paper proposes a two-stage framework for generating regular time series from irregularly sampled data: (1) a TST autoencoder completes missing values to construct a "natural neighborhood," and (2) a masking strategy applied during visual diffusion model training computes loss only on observed pixels, avoiding over-reliance on completed values. The approach achieves an average 70% improvement in discriminative score and a 6.5× training speedup.
A Gradient Flow Approach to Solving Inverse Problems with Latent Diffusion Models: This paper proposes DWGF (Diffusion-regularized Wasserstein Gradient Flow), which rigorously formalizes posterior sampling with latent diffusion models as a regularized gradient flow of KL divergence in the Wasserstein-2 space. An ODE system in the latent space is derived to solve image inverse problems, achieving substantially higher PSNR than baselines on inpainting and super-resolution tasks on FFHQ-512.
Accelerating Parallel Diffusion Model Serving with Residual Compression: This paper proposes CompactFusion, a framework that eliminates communication redundancy in parallel diffusion inference via residual compression—transmitting only the activation differences between adjacent denoising steps rather than full activations. It achieves a 3.0× speedup on 4×L20 GPUs with generation quality significantly superior to DistriFusion, a 6.7× speedup under simulated Ethernet bandwidth, and maintains better quality than DistriFusion even at 100× compression.
AccuQuant: Simulating Multiple Denoising Steps for Quantizing Diffusion Models: This paper reveals an error accumulation phenomenon in diffusion model quantization—where quantization errors at each step propagate and amplify into subsequent steps—and proposes explicitly simulating consecutive multi-step denoising during PTQ calibration to jointly optimize quantization parameters, while reducing memory from O(n) to O(1) through a carefully designed objective function.
Adapting Speech Language Model to Singing Voice Synthesis: This paper adapts a 1.7B-parameter TTS-pretrained Speech Language Model to the Singing Voice Synthesis (SVS) task via score tokenization, multi-stream LM prediction, conditional flow matching refinement, and a vocoder. Using only 135 hours of synthesized singing data, the system achieves performance comparable to dedicated SVS systems.
ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering: This paper introduces ALE-Bench, the first AI benchmark targeting scored algorithm engineering contests (AtCoder Heuristic Contest). It curates 40 NP-hard optimization problems and provides an interactive agent evaluation framework. The strongest model, o3-high, achieves only human-average performance in a one-shot setting, with significant gaps between AI and human experts in cross-problem consistency and long-horizon iterative improvement.
Aligning Compound AI Systems via System-level DPO: This paper models compound AI systems as DAGs and proposes the SysDPO framework, which extends DPO to joint multi-component alignment. By leveraging DAG decomposition, system-level preferences are transformed into an end-to-end optimizable loss function. The authors provide theoretical guarantees of $\beta$-perfect alignment and demonstrate substantial improvements in collaborative quality on both LLM+diffusion model and LLM+LLM systems.
Aligning Text to Image in Diffusion Models is Easier Than You Think: This paper proposes SoftREPA — a lightweight contrastive fine-tuning strategy that introduces learnable soft text tokens (fewer than 1M parameters) to perform contrastive learning on frozen pretrained T2I diffusion models, explicitly maximizing mutual information between text and image representations. SoftREPA significantly improves text-image alignment on SD1.5/SDXL/SD3 and generalizes to both image generation and image editing tasks.
Amortized Sampling with Transferable Normalizing Flows: This work proposes Prose — a 285M-parameter all-atom transferable normalizing flow based on the TarFlow architecture, trained on 21,700 short-peptide MD trajectories (totaling 4.3 ms of simulation time). Prose enables zero-shot uncorrelated proposal sampling for arbitrary short-peptide systems, outperforms MD baselines under equal energy evaluation budgets, and generates samples 4,000× faster than the prior transferable Boltzmann generator (TBG).
AugGen: Synthetic Augmentation using Diffusion Models Can Improve Recognition: This paper proposes AugGen, a self-contained synthetic data augmentation method that trains a class-conditional diffusion model on the target dataset, generates new "mixed-class" samples by interpolating class conditioning vectors across different identities, and uses the resulting augmented data to improve discriminative model training. AugGen achieves 1–12% performance gains on face recognition benchmarks without relying on any external data or auxiliary models.
BADiff: Bandwidth Adaptive Diffusion Model: This paper proposes BADiff—the first bandwidth-adaptive diffusion model—which embeds target entropy constraints as explicit conditions into the diffusion reverse process, coupled with a differentiable entropy regularization loss and an adaptive stopping policy. The model dynamically adjusts generation quality according to real-time bandwidth and terminates sampling adaptively, reducing computational overhead while maintaining perceptual quality. This approach fundamentally avoids the compression artifacts and computational waste inherent in the conventional "high-quality generation → post-compression" pipeline.
Balanced Conic Rectified Flow: To address the distribution drift induced by the reflow step in k-rectified flow, this paper proposes conic reflow: constructing conic supervisory trajectories from the inverted noise of real images and their Slerp-perturbed neighbors, substantially reducing the number of required fake pairs while achieving superior generation quality and straighter ODE trajectories.
Beyond Masked and Unmasked: Discrete Diffusion Models via Partial Masking: Prime (Partial masking scheme) represents each token as a base-$b$ sub-token sequence and independently masks at the sub-token level, introducing intermediate states into masked diffusion models to enable fine-grained denoising. On OpenWebText, it achieves a perplexity of 15.36, becoming the first MDM to surpass ARM (17.54) without relying on an autoregressive formulation.
BitMark: Watermarking Bitwise Autoregressive Image Generative Models: This paper proposes BitMark—the first watermarking scheme for bitwise autoregressive image generative models (Infinity, Instella). During generation, it steers bit sequences toward a "green list" by adding logit biases, enabling reliable detection (z-test), high image fidelity (negligible FID change), robustness against diverse attacks, and radioactivity (downstream models trained on watermarked images also carry the watermark), providing a critical tool for preventing model collapse.
Blameless Users in a Clean Room: Defining Copyright Protection for Generative Models: This paper reconstructs the theoretical foundations of provable copyright protection for generative models. It demonstrates that the existing Near Access-Freeness (NAF) definition fails to prevent verbatim reproduction ("tainted" models), proposes a "blameless user" framework and a clean-room copyright protection definition ($(\kappa,\beta)$-clean), under which users who would not reproduce content in a counterfactual "clean-room setting" are also unlikely to reproduce it in the real world. The paper further proves that differentially private training implies clean-room copyright protection under a "golden dataset" assumption.
Blind Strong Gravitational Lensing Inversion: Joint Inference of Source and Lens Mass with Score-Based Models: This work presents the first application of score-based generative model priors to blind strong gravitational lensing inversion — jointly inferring the morphology of background source galaxies and lens mass distribution parameters. By extending GibbsDDRM to the continuous-time domain, the method achieves reconstruction residuals consistent with observational noise and unbiased marginal posteriors over lens parameters.
BlurDM: A Blur Diffusion Model for Image Deblurring: BlurDM integrates the physical formation process of motion blur (progressive blur accumulation due to continuous exposure) into a diffusion model via a dual forward process (simultaneous noise addition and blurring) and a dual denoising-deblurring reverse process. It serves as a latent-space prior generator that consistently enhances four deblurring methods across four datasets, achieving an average gain of +0.31 dB on GoPro and +0.78 dB on RealBlur-J, while adding only ~4 GFLOPs and ~9 ms.
BlurGuard: A Simple Approach for Robustifying Image Protection Against AI-Powered Edit: This paper proposes BlurGuard—a method that applies mild blurring to an image prior to adversarial perturbation generation, causing the perturbation to couple with low-frequency structures and thereby resist post-processing operations such as JPEG compression and Gaussian noise. This approach more effectively prevents AI editing tools such as Stable Diffusion from tampering with protected images, achieving over 20% improvement in protection success rate compared to the non-blurred baseline.
BoltzNCE: Learning Likelihoods for Boltzmann Generation with Stochastic Interpolants: BoltzNCE trains an Energy-Based Model (EBM) via a hybrid Score Matching + InfoNCE objective to approximate the likelihood of a Boltzmann Generator, eliminating expensive Jacobian trace computations. On alanine dipeptide conformation generation, it achieves a 100× inference speedup with a free energy error of only 0.02 $k_BT$.
Boosting Generative Image Modeling via Joint Image-Feature Synthesis: This paper proposes ReDi (Representation Diffusion), a framework that jointly models VAE image latents and DINOv2 semantic features within a diffusion model — both are simultaneously denoised from pure noise within a single diffusion process. With minimal modifications to the DiT architecture, ReDi achieves a 23× training convergence speedup and state-of-the-art FID, while unlocking a novel Representation Guidance inference strategy.
Breaking AR's Sampling Bottleneck: Provable Acceleration via Diffusion Language Models: This paper establishes a complete convergence theory for masked diffusion language models from an information-theoretic perspective: it proves that the sampling error in KL divergence decays at an $O(1/T)$ rate and scales linearly with inter-token mutual information, provides a matching lower bound to establish tightness, and theoretically demonstrates that diffusion models can generate high-quality samples in $T < L$ steps (where $L$ is the sequence length).
CADMorph: Geometry-Driven Parametric CAD Editing via a Plan-Generate-Verify Loop: This paper proposes CADMorph, an iterative plan–generate–verify framework that leverages a pretrained Parameter-to-Shape (P2S) diffusion model and a Masked-Parameter-Prediction (MPP) large language model to achieve geometry-driven parametric CAD editing without requiring triplet training data.
CAMILA: Context-Aware Masking for Image Editing with Language Alignment: This paper proposes CAMILA, a context-aware image editing method that leverages a multimodal large language model (MLLM) to automatically determine whether a given instruction is executable on the input image. It introduces dedicated [MASK] and [NEG] tokens to distinguish editable regions from regions that should remain unchanged, enabling precise multi-instruction editing while effectively filtering out non-executable instructions.
CaMiT: A Time-Aware Car Model Dataset for Classification and Generation: This paper introduces the CaMiT dataset (787K labeled + 5.1M unlabeled car images, 2005–2023) to systematically study temporal drift in fine-grained visual categories, providing benchmarks across four settings: static pre-training, time-incremental pre-training, time-incremental classifier learning, and time-aware image generation.
Can Knowledge-Graph-based Retrieval Augmented Generation Really Retrieve What You Need?: This paper proposes GraphFlow, a framework that models retrieval over knowledge graphs as a flow-matching problem under GFlowNet, jointly training a retrieval policy and flow estimator via a detailed balance objective and local exploration strategy. On the STaRK benchmark, GraphFlow surpasses GPT-4o by approximately 10% in both retrieval accuracy and diversity.
CDFlow: Building Invertible Layers with Circulant and Diagonal Matrices: CDFlow is proposed to construct invertible linear layers via alternating products of circulant and diagonal matrices, reducing parameter complexity from $\mathcal{O}(n^2)$ to $\mathcal{O}(mn)$, matrix inversion complexity from $\mathcal{O}(n^3)$ to $\mathcal{O}(mn\log n)$, and log-determinant computation from $\mathcal{O}(n^3)$ to $\mathcal{O}(mn)$, outperforming comparable methods on density estimation and periodic data modeling.
Composite Flow Matching for Reinforcement Learning with Shifted-Dynamics Data: This paper proposes CompFlow, a composite flow matching architecture that builds an online flow on top of the offline flow's output distribution to estimate the dynamics shift (Wasserstein distance) between offline and online environments. Combined with an active exploration strategy targeting high-shift regions, CompFlow achieves an average return 14.2% above the strongest baseline across 27 shifted-dynamics RL tasks.
Composition and Alignment of Diffusion Models using Constrained Learning: This paper proposes a unified constrained optimization framework that formalizes reward alignment and multi-model composition of diffusion models as constrained optimization problems. By applying Lagrangian duality, the framework automatically determines optimal weights, eliminating the need for manual hyperparameter search.
Conditional Panoramic Image Generation via Masked Autoregressive Modeling: This paper proposes PAR (Panoramic AutoRegressive model), the first framework to unify text-to-panorama (T2P) and panorama outpainting (PO) under masked autoregressive modeling. PAR addresses the boundary discontinuity inherent in ERP panoramas through a circular translation consistency loss and dual-space circular padding, achieving an FID of 37.37 on Matterport3D while demonstrating strong scalability and zero-shot generalization.
Constrained Discrete Diffusion: This paper proposes CDD (Constrained Discrete Diffusion), which embeds a differentiable constrained optimization projection operator into the denoising process of discrete diffusion models. Without retraining, CDD enforces sequence-level constraints at sampling time, achieving zero constraint violations across three task categories: toxic text generation, molecular design, and instruction following.
Contextual Thompson Sampling via Generation of Missing Data: This paper proposes Generative Thompson Sampling (TS-Gen), which reframes uncertainty in contextual bandits as missing data rather than unknown parameters. A generative model autoregressively imputes missing outcomes to implement Thompson sampling, and a regret bound directly tied to offline prediction loss is established.
Continuous Diffusion Model for Language Modeling: This paper proposes RDLM (Riemannian Diffusion Language Model), which constructs a continuous diffusion process on a statistical manifold (hypersphere) to model discrete distributions. It establishes a theoretical connection between discrete diffusion and continuous flows, and leverages radial symmetry to enable simulation-free training and a dimension-splitting technique for handling large vocabularies. RDLM achieves 1.32 BPC on Text8, surpassing all discrete and continuous diffusion models.
Continuous Uniqueness and Novelty Metrics for Generative Modeling of Inorganic Crystals: This paper identifies four critical flaws in the widely adopted discrete distance function (StructureMatcher) used to evaluate inorganic crystal generative models, and proposes continuous distance functions based on Magpie fingerprints (composition) and AMD vectors (structure) to achieve more reliable uniqueness and novelty metrics.
CORAL: Disentangling Latent Representations in Long-Tailed Diffusion: This paper diagnoses the root cause of tail-class generation degradation in diffusion models trained on long-tailed data as representation entanglement in the U-Net bottleneck layer, and proposes CORAL, which applies a supervised contrastive loss at the bottleneck to disentangle class representations. CORAL consistently outperforms baselines including DDPM, CBDM, and T2H on CIFAR10/100-LT, CelebA-5, and ImageNet-LT.
CORAL: Disentangling Latent Representations in Long-Tailed Diffusion: This paper identifies a phenomenon termed "representation entanglement" in diffusion models trained on long-tailed data, wherein the latent representations at the U-Net bottleneck layer exhibit severe overlap between tail and head class feature spaces. To address this, the authors propose CORAL, which introduces a projection head and a supervised contrastive loss at the bottleneck layer to promote inter-class latent separation, substantially improving the generation quality and diversity of tail classes.
Co-Reinforcement Learning for Unified Multimodal Understanding and Generation: This paper proposes CoRL (Co-Reinforcement Learning), a two-stage framework — Unified RL followed by Refined RL — that simultaneously optimizes both understanding and generation capabilities of Unified Multimodal Language Models (ULMs) via reinforcement learning, achieving synergistic co-evolution of dual capabilities: +7% on generation and +23% on understanding at 1.5B parameters.
Counterfactual Identifiability via Dynamic Optimal Transport: This paper leverages dynamic optimal transport (dynamic OT) theory to resolve—for the first time—the counterfactual identifiability problem in high-dimensional multivariate Markovian SCMs. It proves that the OT flow mechanism yields a unique monotone order-preserving counterfactual transport map, and extends the results to non-Markovian settings (IV/BC/FC criteria).
Coupling Generative Modeling and an Autoencoder with the Causal Bridge: In the presence of unobserved confounders, this paper proposes coupling a generative model with an autoencoder to improve estimation of the causal bridge function—sharing statistical strength across treatment, control, and outcome variables via a shared encoder—and extends the framework to survival analysis.
Cross-fluctuation Phase Transitions Reveal Sampling Dynamics in Diffusion Models: Drawing on fluctuation theory from statistical physics, this work proposes a framework for detecting discrete phase transitions in the sampling process of diffusion models via cross-fluctuations, enabling accelerated sampling, improved conditional generation, zero-shot classification, and style transfer—all without retraining.
Decomate: Leveraging Generative Models for Co-Creative SVG Animation: This paper proposes Decomate, an interactive system that leverages multimodal large language models (MLLMs) to automatically decompose unstructured SVG graphics into semantic components. Designers specify animation behaviors for each component via natural language, and the system generates production-ready HTML/CSS/JS animation code, supporting iterative co-creative workflows.
DEFT: Decompositional Efficient Fine-Tuning for Text-to-Image Models: This paper proposes DEFT (Decompositional Efficient Fine-Tuning), which efficiently fine-tunes T2I models by decomposing weight updates into two components — subspace projection and low-rank adjustment — outperforming LoRA and PaRa on both personalized and general image generation tasks.
Denoising Weak Lensing Mass Maps with Diffusion Model and Generative Adversarial Network: This work applies diffusion models (DM) to the task of weak gravitational lensing mass map denoising and conducts a systematic comparison with GAN (pix2pix) under identical experimental settings, demonstrating that DM comprehensively outperforms GAN in terms of training stability, robustness under multi-sample averaging, and reconstruction accuracy across multiple statistical estimators.
Detecting Generated Images by Fitting Natural Image Distributions: This paper proposes ConV, a consistency verification framework that exploits the geometric discrepancy between the natural image manifold and generated images. By constructing two gradient-orthogonal functions, ConV achieves training-free generated image detection. An enhanced variant, F-ConV, further amplifies manifold deviation via Normalizing Flows.
Detection and Simulation of Urban Heat Islands Using a Fine-Tuned Geospatial Foundation Model: This paper proposes a unified three-stage workflow based on a fine-tuned geospatial foundation model (Granite-GFM): first establishing an empirical baseline via green space cooling effects to verify physical plausibility; then extrapolating urban temperatures under future climate scenarios; and finally simulating the cooling impact of greening interventions via inpainting. This elevates the foundation model from an evaluation tool to an interactive simulation platform for urban planning.
DEXTER: Diffusion-Guided EXplanations with TExtual Reasoning for Vision Models: This paper proposes DEXTER, a data-free framework that optimizes textual prompts to drive a diffusion model to generate images maximizing target classifier activations, then employs an LLM to reason over the synthesized samples and produce globally coherent, human-readable textual explanations, enabling bias discovery and global interpretation of model behavior.
DiCo: Revitalizing ConvNets for Scalable and Efficient Diffusion Modeling: This paper finds that the global self-attention in pretrained DiTs primarily captures local patterns and thus exhibits substantial redundancy in generative tasks. It proposes DiCo, a purely convolutional diffusion model built from standard convolution modules and a Compact Channel Attention (CCA) mechanism. DiCo achieves an FID of 2.05 on ImageNet-256, surpassing DiT-XL/2, with 2.7× faster inference at 256 resolution and 3.1× faster at 512 resolution.
Diff-ICMH: Harmonizing Machine and Human Vision in Image Compression with Generative Prior: This paper proposes Diff-ICMH, a diffusion-based generative image compression framework that preserves semantic integrity via a Semantic Consistency (SC) loss and activates generative priors via a Tag Guidance Module (TGM). Using a single encoder-decoder and a single bitstream, the framework simultaneously serves 10+ machine intelligence tasks and human visual perception without any task-specific adaptation.
DiffEye: Diffusion-Based Continuous Eye-Tracking Data Generation Conditioned on Natural Images: This paper proposes DiffEye, the first diffusion-based framework that directly utilizes raw eye-tracking data to generate continuous and diverse eye movement trajectories conditioned on natural images, while introducing Corresponding Position Embedding (CPE) to align the gaze space with the image semantic space.
Diffusion-Based Electromagnetic Inverse Design of Scattering Structured Media: This paper proposes a conditional diffusion model-based framework for electromagnetic inverse design that directly generates dielectric-sphere metasurface geometries from target differential scattering cross sections (DSCS), bypassing costly iterative optimization. The approach naturally handles the non-uniqueness of the inverse problem and outperforms CMA-ES evolutionary optimization while being orders of magnitude faster.
Diffusion-Classifier Synergy: Reward-Aligned Learning via Mutual Boosting Loop for FSCIL: This paper proposes the Diffusion-Classifier Synergy (DCS) framework, which establishes a closed-loop mutual boosting cycle between a diffusion model and a classifier. A multi-level reward function (feature-level + logits-level) guides the diffusion model to generate images most beneficial to the classifier, achieving state-of-the-art performance on FSCIL benchmarks.
Diffusion-Driven Progressive Target Manipulation for Source-Free Domain Adaptation: This paper proposes the DPTM framework, which leverages a latent diffusion model to perform semantic transformation on unreliable target samples, generating a pseudo-target domain and iteratively narrowing the gap with the real target domain via a progressive reconstruction mechanism. DPTM achieves up to 18.6% improvement over existing SFDA state-of-the-art methods under large domain shift scenarios.
Diffusion Adaptive Text Embedding for Text-to-Image Diffusion Models: This paper proposes DATE (Diffusion Adaptive Text Embedding), which dynamically updates text embeddings during diffusion model sampling based on the current denoising intermediate results, improving text-image semantic alignment without any additional training.
Diffusion Classifiers Understand Compositionality, but Conditions Apply: A comprehensive study of zero-shot diffusion classifiers on compositional understanding tasks: covering 3 diffusion models (SD 1.5/2.0/3-m) × 10 datasets × 30+ tasks. The paper introduces Self-Bench, a diagnostic benchmark that eliminates domain gap by using images generated by the diffusion models themselves, and finds that diffusion classifiers do understand compositionality—but performance is conditioned on domain alignment and timestep weighting, hence "conditions apply."
Diffusion Generative Modeling on Lie Group Representations: This paper proposes a novel theoretical framework for constructing diffusion processes on the representation space of Lie groups (rather than on the Lie groups themselves). By mapping the curved dynamics of non-Abelian Lie groups into Euclidean space via generalized score matching, the framework enables simulation-free training of Lie group diffusion models, and demonstrates that standard score matching is a special case corresponding to the translation group.
Diffusion Models Meet Contextual Bandits: This paper proposes diffusion Thompson Sampling (dTS), which employs a pretrained diffusion model as an expressive prior over action parameters in contextual bandit problems. Through an efficient hierarchical posterior approximation, dTS enables fast updates and sampling, significantly outperforming conventional methods in large action spaces.
Distilled Decoding 2: One-step Sampling of Image Auto-regressive Models with Conditional Score Distillation: This paper proposes Distilled Decoding 2 (DD2), which reinterprets auto-regressive image models as conditional score models and designs a Conditional Score Distillation (CSD) loss to compress multi-step AR sampling into one-step generation. On ImageNet-256, DD2 achieves only a marginal FID degradation from 3.40 to 5.43 while obtaining 8.0× speedup (VAR) and 238× speedup (LlamaGen), closing 67% of the performance gap relative to DD1 and training 12.3× faster.
DOVE: Efficient One-Step Diffusion Model for Real-World Video Super-Resolution: This paper presents DOVE, a video super-resolution model built upon the CogVideoX pretrained video generation model. Through a two-stage latent-pixel space training strategy and a curated high-quality HQ-VSR dataset, DOVE achieves single-step inference for video super-resolution, delivering 28× speedup over multi-step diffusion methods while achieving comparable or superior performance.
Dual Data Alignment Makes AI-Generated Image Detector Easier Generalizable: This paper proposes Dual Data Alignment (DDA), which generates synthetic training images via pixel-domain and frequency-domain dual alignment to eliminate spurious correlations caused by dataset bias. By forcing the detector to learn only forgery-relevant features, DDA achieves an average accuracy of 90.7% across 11 benchmarks, substantially outperforming existing methods.
EditInfinity: Image Editing with Binary-Quantized Generative Models: This paper proposes EditInfinity, the first work to apply the classical "image inversion–image editing" paradigm to binary-quantized autoregressive generative models (Infinity). By leveraging the inherent property of quantized representations that enables exact intermediate supervision, EditInfinity achieves high-fidelity image inversion. Combined with a piecewise linear smoothing kernel for seamless editing, it comprehensively surpasses diffusion model baselines on PIE-Bench.
EEGReXferNet: A Lightweight Gen-AI Framework for EEG Subspace Reconstruction via Cross-Subject Transfer Learning and Channel-Aware Embedding: This paper proposes EEGReXferNet, a lightweight generative AI framework that achieves EEG subspace reconstruction under a cross-subject transfer learning setting via neighborhood channel-aware input selection, band-specific sub-window convolutional encoding/decoding, a dynamic sliding-window latent space, and reference statistics scaling. The framework reduces parameter count by approximately 45% and achieves inference latency <1ms, while maintaining PSD correlation $\geq 0.95$ and spectrogram RV coefficient $\geq 0.85$.
Efficient Rectified Flow for Image Fusion: This paper proposes RFfusion, which introduces Rectified Flow into image fusion for the first time, enabling training-free one-step sampling. A two-stage fusion-oriented VAE training strategy is further designed, achieving comprehensive superiority over existing diffusion-based fusion methods in both speed and quality.
Elucidated Rolling Diffusion Models for Probabilistic Forecasting of Complex Dynamics: This paper proposes ERDM, the first framework to successfully unify the Rolling Diffusion paradigm with the principled design choices of EDM (noise schedule, preconditioning, Heun sampler). By employing a progressive noise schedule that explicitly models growing uncertainty, ERDM significantly outperforms autoregressive EDM baselines on Navier-Stokes and ERA5 weather forecasting benchmarks.
Emergence and Evolution of Interpretable Concepts in Diffusion Models: This work is the first to systematically apply Sparse Autoencoders (SAEs) to multi-step diffusion models (Stable Diffusion v1.4), revealing that image composition emerges as early as the first reverse diffusion step while stylistic concepts form during intermediate stages. Based on these findings, the paper proposes temporally adaptive causal intervention techniques.
Encoder-Decoder Diffusion Language Models for Efficient Training and Inference: This paper proposes E2D2, an encoder-decoder architecture for discrete diffusion language models that performs iterative denoising via a lightweight decoder while periodically updating representations through a large encoder, achieving faster inference (~3× vs. MDLM) and more efficient block diffusion training (halving FLOPs).
Energy Loss Functions for Physical Systems: This paper proposes a physics-based energy loss function framework. By deriving an energy-difference loss grounded in pairwise distances via reverse KL divergence and the Boltzmann distribution, the framework naturally satisfies SE(d) invariance and substantially outperforms MSE and cross-entropy losses on molecular generation and spin ground-state prediction tasks.
Enhancing Diffusion Model Guidance through Calibration and Regularization: To address the vanishing gradient problem caused by overconfident classifiers in classifier-guided diffusion models, this paper proposes two complementary approaches: (1) a Smooth ECE calibration loss for fine-tuning classifiers, yielding ~3% FID improvement; and (2) regularized sampling guidance based on f-divergences (RKL/FKL/JS) that requires no retraining, achieving FID 2.13 on ImageNet 128×128.
Entropy Rectifying Guidance for Diffusion and Flow Models: This paper proposes Entropy Rectifying Guidance (ERG), which manipulates the Hopfield energy landscape of attention layers (via temperature scaling and step-size adjustment) to obtain a weak prediction signal as a substitute for the unconditional prediction in conventional CFG, simultaneously improving quality, diversity, and consistency in text-to-image, class-conditional, and unconditional generation.
Epistemic Uncertainty for Generated Image Detection: This paper proposes WePe (Weight Perturbation), which estimates epistemic uncertainty by applying weight perturbations to a pretrained vision foundation model (DINOv2). The method exploits the divergence between natural and AI-generated images in uncertainty space for detection, requiring no training.
Equivariant Flow Matching for Symmetry-Breaking Bifurcation Problems: This paper proposes an equivariant flow matching framework combined with a symmetric coupling strategy to model multimodal probability distributions arising in symmetry-breaking bifurcation problems via generative AI, significantly outperforming deterministic models and VAEs on physical systems (buckling beam, Allen-Cahn equation).
Evaluating the Evaluators: Metrics for Compositional Text-to-Image Generation: This paper systematically evaluates 12 text-image compositional alignment metrics against human judgments, finding that no single metric consistently outperforms all others across compositional categories, that VQA metrics are not always superior, and that embedding-based metrics (ImageReward, HPS) are stronger on certain categories.
EVODiff: Entropy-aware Variance Optimized Diffusion Inference: This paper analyzes the inference process of diffusion models from an information-theoretic perspective and proposes EVODiff, a method that reduces conditional entropy by optimizing conditional variance, achieving significant sampling acceleration and quality improvement without modifying the underlying model.
Evolve to Inspire: Novelty Search for Diverse Image Generation: This paper proposes Wander, a framework that leverages novelty search and LLM-driven prompt evolution to generate highly diverse image collections from a single text prompt, surpassing existing evolutionary prompt optimization baselines on the Vendi Score metric.
Exploring Semantic-constrained Adversarial Example with Instruction Uncertainty Reduction: This paper proposes InSUR, a multi-dimensional instruction uncertainty reduction framework that stabilizes adversarial optimization via a ResAdv-DDIM sampler, constrains attack scenarios through context-aware encoding, and evaluates semantic fidelity using WordNet-based semantic abstraction. InSUR is the first method to generate 2D/3D semantic-constrained adversarial examples (SemanticAE) from natural language instructions.
Exploring Variational Graph Autoencoders for Distribution Grid Data Generation: This paper systematically evaluates four variational graph autoencoder (VGAE) decoder architectures for synthesizing distribution grid topologies. The Iterative-GCN decoder is found to adequately reproduce structural and spectral characteristics of real grids on small, homogeneous datasets; however, on large, heterogeneous datasets, all methods exhibit critical failure modes including disconnected components and repetitive patterns.
Failure Prediction at Runtime for Generative Robot Policies: This paper proposes FIPER, a framework for runtime failure prediction in generative robot policies (diffusion/flow matching). It jointly evaluates an observation-side metric RND-OE (OOD detection) and an action-side metric ACE (Action Chunk Entropy) to enable early and accurate failure prediction without any failure data, with statistical guarantees provided via conformal prediction.
FairImagen: Post-Processing for Bias Mitigation in Text-to-Image Models: This paper proposes FairImagen, a post-processing debiasing framework that applies FairPCA projection in the CLIP prompt embedding space to remove demographic information, combined with empirical noise injection and joint cross-demographic debiasing, achieving significant fairness improvements in text-to-image generation without retraining the model.
FALCON: Few-step Accurate Likelihoods for Continuous Flows: This paper proposes FALCON, which employs a hybrid training objective (flow matching + mean velocity loss + invertibility regularization) to enable continuous normalizing flows to provide sufficiently accurate likelihood estimates under few-step sampling, achieving Boltzmann sampling two orders of magnitude faster than conventional CNFs.
Fast Data Attribution for Text-to-Image Models: This work distills the accurate but computationally expensive Attribution by Unlearning (AbU) method into a lightweight feature embedding space. By training via learning-to-rank, simple cosine similarity retrieval approximates the costly attribution ranking, enabling millisecond-level data attribution at the scale of Stable Diffusion + LAION-400M for the first time.
Fast Solvers for Discrete Diffusion Models: Theory and Applications of High-Order Algorithms: This work introduces high-order numerical methods into discrete diffusion model inference for the first time, proposing two second-order solvers — θ-RK-2 and θ-Trapezoidal — and theoretically proving that θ-Trapezoidal improves the discretization error from first-order $\mathcal{O}(\kappa T)$ to second-order $\mathcal{O}(\kappa^2 T)$. Experiments spanning 200M–8B models consistently demonstrate improvements across text, image, and mathematical reasoning tasks.
FerretNet: Efficient Synthetic Image Detection via Local Pixel Dependencies: Grounded in Markov Random Field (MRF) theory, this paper proposes a Local Pixel Dependency (LPD) feature representation that exposes textural inconsistencies in generated images via median-filter reconstruction. Combined with FerretNet, a lightweight convolutional network with only 1.1M parameters, the approach achieves an average detection accuracy of 97.1% across 22 generative models while being trained exclusively on 4 categories of ProGAN data.
Flatten Graphs as Sequences: Transformers are Scalable Graph Generators: This paper proposes AutoGraph, which losslessly flattens graphs into token sequences via Segmented Eulerian Neighborhood Trails (SENT), enabling direct modeling with a decoder-only Transformer. AutoGraph achieves graph generation speeds ~100× faster than diffusion models while reaching state-of-the-art performance on both synthetic and molecular benchmarks.
Flattening Hierarchies with Policy Bootstrapping: This paper proposes SAW (Subgoal Advantage-Weighted Policy Bootstrapping), which distills the long-horizon reasoning advantages of hierarchical RL into a single flat policy by sampling subgoals from in-dataset trajectories and performing policy bootstrapping via advantage-weighted importance sampling. The approach requires no learned subgoal generative model, and matches or surpasses state-of-the-art performance across 20 offline GCRL datasets.
Flex-Judge: Text-Only Reasoning Unleashes Zero-Shot Multimodal Evaluators: This paper proposes Flex-Judge, which fine-tunes a multimodal large language model on only 1K text-only reasoning samples to achieve zero-shot generalization across image, video, audio, and molecular evaluation tasks, matching or surpassing commercial APIs such as GPT-4o and specialized evaluators trained on large-scale annotated data.
Flow Matching Neural Processes: This paper proposes FlowNP, which integrates flow matching into the neural process framework. By employing a transformer to predict velocity fields at target points, FlowNP enables parallel sampling from conditional distributions, achieving state-of-the-art performance on three benchmarks spanning 1D Gaussian processes, image data, and meteorological data.
FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks: This paper proposes FocalCodec—a low-bitrate speech codec based on Focal Modulation Networks—that compresses speech to 0.16–0.65 kbps using a single binary codebook, achieving performance comparable to or better than multi-codebook state-of-the-art methods on speech resynthesis, voice conversion, and multiple downstream tasks.
FreqPolicy: Efficient Flow-based Visuomotor Policy via Frequency Consistency: This work is the first to introduce frequency-domain consistency constraints into flow-based visuomotor policies. By projecting action chunk velocity fields into the frequency domain via DCT and imposing an adaptive frequency component loss, it achieves high-quality one-step action generation at 93.5 Hz, outperforming existing one-step generation methods on both simulation and real-robot tasks.
From Cradle to Cane: A Two-Pass Framework for High-Fidelity Lifespan Face Aging: This paper proposes Cradle2Cane, a two-pass face aging framework: the first pass achieves precise age control via Adaptive Noise Injection (AdaNI), and the second pass reinforces identity consistency through dual identity embeddings (IDEmb) comprising SVR-ArcFace and Rotate-CLIP. The framework achieves an optimal balance between age accuracy and identity preservation across the full lifespan (0–80 years).
GeneMAN: Generalizable Single-Image 3D Human Reconstruction from Multi-Source Human Data: GeneMAN proposes a generalizable single-image 3D human reconstruction framework that requires no parametric body model (e.g., SMPL). By training human-specific 2D/3D diffusion prior models on large-scale multi-source human data, and combining a geometry initialization-sculpting pipeline with multi-space texture refinement, GeneMAN achieves high-fidelity 3D human reconstruction from in-the-wild images, handling diverse body proportions, complex poses, and personal belongings.
Generative Model Inversion Through the Lens of the Manifold Hypothesis: This paper reveals, from a manifold-geometric perspective, that the essence of generative model inversion attacks (MIA) is implicit denoising achieved by projecting loss gradients onto the generator's tangent space. It proposes the gradient-manifold alignment hypothesis (higher alignment → greater vulnerability), and designs a training-free method, AlignMI, that consistently and significantly improves upon multiple state-of-the-art attacks.
GenIR: Generative Visual Feedback for Mental Image Retrieval: This paper proposes GenIR, a multi-round interactive image retrieval framework that leverages text-to-image diffusion models to generate "synthetic visual feedback," explicitly visualizing the system's interpretation of the user's query. This enables users to intuitively identify discrepancies and iteratively refine their queries, achieving substantial improvements over text-only feedback methods on the Mental Image Retrieval (MIR) task.
GeoRemover: Removing Objects and Their Causal Visual Artifacts: GeoRemover is a geometry-aware two-stage framework that decouples object removal into geometric removal (depth domain) and appearance rendering (RGB domain). By modifying the scene's geometric representation, it implicitly eliminates causal visual artifacts—such as shadows and reflections—left by the removed object.
Gradient Variance Reveals Failure Modes in Flow-Based Generative Models: By analyzing the gradient variance of the CFM loss, this paper demonstrates that Rectified Flow inevitably memorizes training pairs under deterministic interpolation rather than learning an optimal transport map, and proves that introducing stochasticity (stochastic interpolants) breaks this memorization channel and restores generalization.
GraLoRA: Granular Low-Rank Adaptation for Parameter-Efficient Fine-Tuning: This paper proposes GraLoRA, which partitions the LoRA weight update matrix into $k^2$ independent sub-blocks, each equipped with its own low-rank adapter pair. Without increasing parameter count or computational cost, GraLoRA elevates the effective rank from $r$ to $kr$, addressing the performance degradation caused by gradient entanglement in high-rank LoRA. On code generation, Pass@1 improves by up to +8.5%.
Graph-based Neural Space Weather Forecasting: This paper proposes a graph neural network-based neural emulator for space weather, trained on Vlasiator hybrid-Vlasov simulation data, enabling both deterministic and probabilistic autoregressive forecasting of near-Earth space conditions. The emulator achieves over 100× speedup relative to the original simulator and quantifies forecast uncertainty through latent-variable ensemble generation.
Graph Diffusion that can Insert and Delete: This paper proposes GrIDDD, the first model to extend discrete denoising diffusion probabilistic models (DDPM) to support dynamic insertion and deletion of graph nodes during generation, allowing molecular graph size to adapt throughout the diffusion process. GrIDDD matches or surpasses existing methods on property targeting and molecular optimization tasks.
Graph Distance as Surprise: Free Energy Minimization in Knowledge Graph Reasoning: This paper establishes a formal connection between the Free Energy Principle (FEP) from neuroscience and knowledge graph reasoning. It proposes using shortest-path graph distance as a measure of surprise, generalizing the tree-structured surprise theory of Murphy et al. to arbitrary directed graphs, and provides a principled theoretical framework for entity grounding in KG-based agents.
GSPN-2: Efficient Parallel Sequence Modeling: GSPN-2 achieves up to 40× speedup over GSPN-1's 2D spatial propagation through algorithm-system co-design — specifically, single-kernel fusion, compact channel propagation, and shared memory optimization — while matching Transformer-level accuracy on ImageNet classification and text-to-image generation at significantly lower computational cost.
Guided Diffusion Sampling on Function Spaces with Applications to PDEs: This paper proposes FunDPS (Function-space Diffusion Posterior Sampling), which trains an unconditional diffusion model in function space and performs plug-and-play posterior sampling for PDE inverse problems via gradient guidance at inference time. Theoretically, it extends the Tweedie formula to infinite-dimensional Banach spaces. Empirically, across 5 PDE tasks with only 3% observations, FunDPS achieves 32% higher accuracy on average than DiffusionPDE while reducing the number of sampling steps by 4×.
GuideFlow3D: Optimization-Guided Rectified Flow For Appearance Transfer: This paper proposes GuideFlow3D, a training-free 3D appearance transfer framework that alternately injects differentiable guidance losses (part-aware appearance loss + self-similarity loss) into the sampling process of a pretrained rectified flow model, enabling robust texture and geometric detail transfer between objects with significant geometric discrepancies.
Hallucination as an Upper Bound: A New Perspective on Text-to-Image Evaluation: This paper proposes a definition of hallucination in text-to-image (T2I) models as bias-driven deviation, establishes a taxonomy of three hallucination categories—attribute, relation, and object—and argues that hallucination evaluation serves as an "upper bound" for prompt alignment evaluation, thereby revealing hidden model biases.
Head Pursuit: Probing Attention Specialization in Multimodal Transformers: This paper reinterprets the classical sparse signal recovery algorithm (SOMP) as a multi-sample interpretability tool, revealing fine-grained semantic specialization of attention heads in LLMs and VLMs. By flipping approximately 1% of heads, specific concepts (e.g., country names, toxic content, colors) can be reliably suppressed or amplified during generation.
Hephaestus: Mixture Generative Modeling with Energy Guidance for Large-scale QoS Degradation: This paper proposes Hephaestus, a three-stage generative framework (Forge-Morph-Refine) that combines a Predicted Path Pressurization (PPS) algorithm, an energy-guided mixture CVAE, and latent-space reinforcement learning optimization to address large-scale network QoS degradation problems.
Hierarchical Koopman Diffusion: Fast Generation with Interpretable Diffusion Trajectory: Grounded in Koopman operator theory, this work lifts the nonlinear denoising dynamics of diffusion models into a linear Koopman space, enabling one-step sampling through hierarchical decomposition while preserving the interpretability and controllability of intermediate generation states.
High-order Equivariant Flow Matching for Density Functional Theory Hamiltonian Prediction: This paper proposes QHFlow, the first method to apply conditional flow matching to density functional theory (DFT) Hamiltonian matrix prediction. By designing high-order SE(3)-equivariant vector fields and symmetry-aware prior distributions, QHFlow reduces Hamiltonian prediction error by 73% on MD17 and accelerates DFT computation by 54% when used as an SCF initializer.
Highlighting What Matters: Promptable Embeddings for Attribute-Focused Image Retrieval: This paper proposes Promptable Embeddings, a method that highlights target visual attributes at retrieval time to improve attribute-focused text-to-image retrieval, and introduces the COCO-Facet benchmark dataset.
HollowFlow: Efficient Sample Likelihood Evaluation using Hollow Message Passing: This paper proposes HollowFlow, a framework that enforces a block-diagonal structure on the Jacobian of the velocity field via Non-Backtracking Graph Neural Networks (NoBGNN) and Hollow Message Passing, reducing the number of backward passes required for likelihood computation in Continuous Normalizing Flows from $\mathcal{O}(n)$ to a constant $\mathcal{O}(d)$, achieving up to $10^2\times$ sampling speedup.
How to Build a Consistency Model: Learning Flow Maps via Self-Distillation: This paper proposes a unified self-distillation framework for directly learning flow maps (the generalized form of consistency models). By exploiting the tangent condition, any distillation scheme is converted into a direct training algorithm that requires no pretrained teacher. Three algorithm families are derived (Eulerian / Lagrangian / Progressive), among which the Lagrangian method avoids spatial gradients and bootstrapping, achieving the most stable training and best performance.
Image Super-Resolution with Guarantees via Conformalized Generative Models: This work applies Conformal Prediction to construct binary "confidence masks" for generative image super-resolution models, reliably identifying trustworthy regions in generated images with rigorous statistical guarantees.
ImageSentinel: Protecting Visual Datasets from Unauthorized Retrieval-Augmented Image Generation: This paper proposes the ImageSentinel framework, which synthesizes sentinel images that are visually consistent with a private dataset and binds them to randomly generated character retrieval keys, enabling reliable detection of unauthorized use of private datasets by retrieval-augmented image generation (RAIG) systems—achieving near-100% AUC with only 3–10 queries.
Improved Training Technique for Shortcut Models (iSM): Targeting five key performance bottlenecks of Shortcut Models (compounding guidance, fixed guidance, frequency bias, self-consistency deviation, and curved trajectories), this paper proposes iSM, a unified training framework that incorporates intrinsic guidance, multi-level wavelet loss, scaling optimal transport, and twin EMA strategy, achieving substantial improvements on ImageNet 256×256 with one-step FID 5.27 and four-step FID 2.05.
Improving Posterior Inference of Galaxy Properties with Image-Based Conditional Flow Matching: This paper proposes a Conditional Flow Matching (CFM) framework that jointly models morphological information from galaxy images alongside photometric data, substantially improving posterior inference of physical galaxy properties including stellar mass, star formation rate, metallicity, and dust extinction.
ICEdit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer: ICEdit proposes an in-context editing paradigm built upon large-scale Diffusion Transformers (DiT), achieving state-of-the-art editing performance with only 0.1% of the training data through an in-context prompt design, lightweight LoRA-MoE fine-tuning, and VLM-guided early-filter inference-time scaling.
Increasing the Utility of Synthetic Images through Chamfer Guidance: This paper proposes Chamfer Guidance — a training-free inference-time guidance method that uses a small number of real samples as references. By leveraging Chamfer distance, it simultaneously optimizes fidelity and diversity of synthetic images. On ImageNet-1k, using only 32 real images, it achieves 97.5% Precision and 92.7% Coverage, and delivers up to 16% accuracy improvement in downstream classifier training.
Inference-Time Scaling for Flow Models via Stochastic Generation and Rollover Budget Forcing: This paper proposes inference-time scaling methods for flow models: stochasticity is introduced via ODE→SDE conversion to enable particle sampling; the search space is expanded through linear→VP interpolant conversion; and a Rollover Budget Forcing (RBF) strategy is designed to adaptively allocate the computational budget. The approach substantially outperforms all existing methods on compositional text-to-image generation and quantity-aware generation tasks.
InfinityStar: Unified Spacetime AutoRegressive Modeling for Visual Generation: This paper proposes InfinityStar, the first purely discrete autoregressive model capable of generating industrial-grade 720p video. Through spacetime pyramid modeling, it unifies T2I/T2V/I2V/interactive long video generation, achieving a VBench score of 83.74 that surpasses HunyuanVideo, with inference speeds 10–32× faster than diffusion models.
Information-Theoretic Discrete Diffusion: This work generalizes the classical I-MMSE identity from continuous diffusion to the discrete domain, establishing the I-MDSE and I-MDCE relations. It proves that DSE/DCE losses are not merely variational upper bounds but exact decompositions of the log-likelihood, and derives time-free formulas, conditional likelihood estimators, and coupled likelihood-ratio estimators. The proposed methods are validated on large-scale models such as LLaDA, demonstrating low variance and out-of-distribution detection capability.
Information Theoretic Learning for Diffusion Models with Warm Start: This paper proposes a likelihood estimation framework that generalizes the classical KL divergence–Fisher information relationship to arbitrary isotropic noise perturbations, combined with warm-start noise injection and importance sampling to eliminate the train-test gap and achieve tighter likelihood upper bounds, attaining state-of-the-art NLL on ImageNet at multiple resolutions.
Instance-Level Composed Image Retrieval: This paper proposes the instance-level composed image retrieval (i-CIR) benchmark and a training-free method, BASIC, which independently estimates image and text query similarities and fuses them via multiplicative combination, achieving state-of-the-art performance on both i-CIR and existing CIR datasets without any training.
Is Artificial Intelligence Generated Image Detection a Solved Problem?: This paper proposes AIGIBench, a comprehensive benchmark that systematically evaluates 11 state-of-the-art detectors across four tasks—multi-source generalization, multi-degradation robustness, data augmentation sensitivity, and test-time preprocessing impact—revealing that existing AIGI detection methods suffer severe performance degradation in real-world scenarios, demonstrating that the problem is far from solved.
ItDPDM: Information-Theoretic Discrete Poisson Diffusion Model: This paper proposes ItDPDM (Information-Theoretic Discrete Poisson Diffusion Model), which achieves exact likelihood estimation for non-negative discrete data via a Poisson noise channel and a Poisson Reconstruction Loss (PRL), eliminating ELBO approximation and dequantization. The model outperforms existing discrete diffusion models in likelihood estimation on synthetic data, CIFAR-10, and MIDI music.
Janus-Pro-R1: Advancing Collaborative Visual Comprehension and Generation via Reinforcement Learning: This paper proposes Janus-Pro-R1, which achieves synergistic advancement in visual understanding and generation through a two-stage training pipeline (SFT + RL). The approach enables MLLMs to form genuine Chain-of-Thought reasoning and trigger Aha Moments during text-to-image generation, surpassing GPT-4o on GenEval while extending naturally to image editing tasks.
KLASS: KL-Guided Fast Inference in Masked Diffusion Models: This paper proposes KLASS (KL-Adaptive Stability Sampling), a training-free sampling method that leverages token-level KL divergence and confidence scores to identify stable tokens for parallel decoding, achieving up to 2.78× speedup on masked diffusion models without sacrificing—and in many cases improving—generation quality.
Knowledge Distillation Detection for Open-weights Models: This paper introduces the task of knowledge distillation detection, proposing a data-free input synthesis and statistical scoring framework to determine whether an open-weights student model has been distilled from a specific teacher model.
Kuramoto Orientation Diffusion Models: This work introduces Kuramoto synchronization dynamics from biological systems into score-based generative models, constructing a forward synchronization / reverse desynchronization diffusion framework over the periodic domain. The proposed approach achieves substantially superior generation quality over standard diffusion models on orientation-dense data such as fingerprints and textures, while remaining competitive on CIFAR-10.
Large-Scale Training Data Attribution for Music Generative Models via Unlearning: This paper applies machine unlearning-based training data attribution (TDA) to a large-scale text-to-music diffusion model (115K tracks), identifying optimal hyperparameter configurations via grid search and comparing against non-counterfactual methods, thereby demonstrating the feasibility of unlearning-based TDA in the music generation domain.
Latent Zoning Network: A Unified Principle for Generative Modeling, Representation Learning, and Classification: This paper proposes the Latent Zoning Network (LZN)—a framework that unifies generative modeling, representation learning, and classification within a shared Gaussian latent space. Each data type is equipped with an encoder-decoder pair that maps samples to disjoint latent zones. Only two atomic operations—latent computation and latent alignment—are required to support diverse ML tasks. LZN reduces unconditional generation FID on CIFAR10 from 2.76 to 2.59 and surpasses SimCLR on ImageNet linear classification.
LeapFactual: Reliable Visual Counterfactual Explanation Using Conditional Flow Matching: This paper proposes LeapFactual, a counterfactual explanation algorithm based on Conditional Flow Matching (CFM), which bridges flattened and structured latent spaces via a "Lift-Land" (Leap) mechanism to generate reliable, in-distribution counterfactual samples that remain effective even when the learned decision boundary deviates from the true boundary.
Learnable Sampler Distillation for Discrete Diffusion Models: This paper proposes LSD and LSD+, which distill the intermediate score trajectory knowledge of a high-fidelity teacher sampler into a few-step student sampler via learnable sampling coefficients and non-uniform time scheduling, enabling efficient and high-quality sampling for discrete diffusion models.
Learning Interpretable Features in Audio Latent Spaces via Sparse Autoencoders: This paper proposes a framework for extracting interpretable features from the latent spaces of audio generative models via sparse autoencoders (SAEs). Linear probes are used to map SAE features to human-understandable acoustic concepts (pitch, amplitude, timbre), enabling controllable manipulation and visualization of the audio generation process.
Learning to Integrate Diffusion ODEs by Averaging the Derivatives: This paper proposes the Secant Losses family, which learns to integrate diffusion ODEs via Monte Carlo integration and Picard iteration, progressively extending the tangent of a diffusion model into a secant. The approach achieves an excellent balance between training stability and few-step inference.
Linear Differential Vision Transformer: Learning Visual Contrasts via Pairwise Differentials: This paper proposes Visual-Contrast Attention (VCA), which generates compact positive/negative visual-contrast tokens via spatial pooling and performs differential interaction, reducing self-attention complexity from $O(N^2C)$ to $O(NnC)$ ($n \ll N$), while achieving consistent improvements on both image classification and generation tasks.
LinEAS: End-to-end Learning of Activation Steering with a Distributional Loss: This paper proposes LinEAS (Linear End-to-end Activation Steering), which jointly optimizes cross-layer affine transformations in an end-to-end manner using a 1D Wasserstein distributional loss for global activation alignment. With only 32 unpaired samples, LinEAS efficiently steers LLM toxicity and controls concept generation in T2I models.
LLM Meets Diffusion: A Hybrid Framework for Crystal Material Generation: This paper proposes CrysLLMGen, a hybrid framework that combines the complementary strengths of LLMs in discrete atom type prediction and diffusion models in continuous coordinate/lattice parameter modeling, achieving high structural validity and compositional validity simultaneously in crystal material generation.
MGE-LDM: Joint Latent Diffusion for Simultaneous Music Generation and Source Extraction: This paper proposes MGE-LDM, the first model to simultaneously achieve music mixture generation, partial generation (source completion), and text-driven arbitrary source extraction within a unified latent diffusion framework. It jointly models mixture–submixture–source triplets and leverages diffusion inpainting to handle each task.
Mind-the-Glitch: Visual Correspondence for Detecting Inconsistencies in Subject-Driven Generation: This paper proposes a framework for decoupling semantic and visual features from a pretrained diffusion model backbone to enable visual correspondence matching. Building on this, it introduces the Visual Semantic Matching (VSM) metric, which for the first time simultaneously supports quantification and spatial localization of visual inconsistencies in subject-driven image generation.
Mitigating Intra- and Inter-modal Forgetting in Continual Learning of Unified Multimodal Models: This paper proposes Modality-Decoupled Experts (MoDE), which decouples text and image adapters into independent T-MoE and V-Adapter subspaces, combined with knowledge distillation, to simultaneously mitigate intra-modal and inter-modal forgetting in continual instruction tuning of unified multimodal generation models.
Mitigating Sexual Content Generation via Embedding Distortion in Text-conditioned Diffusion Models: This paper proposes Distorting Embedding Space (DES), a text-encoder-based defense framework that achieves state-of-the-art sexual content mitigation on FLUX.1 and SD v1.5 (reducing ASR to 9.47% and 0.52%, respectively) by transforming unsafe embeddings into safe regions, preserving safe embeddings, and neutralizing "nudity" semantics, while maintaining high-quality benign image generation.
MMaDA: Multimodal Large Diffusion Language Models: This paper presents MMaDA, the first multimodal foundation model that simultaneously achieves text reasoning, multimodal understanding, and text-to-image generation within a unified discrete diffusion architecture. MMaDA bridges the gap between diffusion model pre-training and post-training through mixed long chain-of-thought (CoT) fine-tuning and the UniGRPO reinforcement learning algorithm.
MMG: Mutual Information Estimation via the MMSE Gap in Diffusion: Leveraging the information-theoretic formulation of diffusion models, this paper proves that mutual information equals one-half of the integral over all signal-to-noise ratios of the gap between conditional and unconditional denoising MMSE. The proposed MMG estimator, combined with adaptive importance sampling and the orthogonality principle, significantly improves estimation accuracy and stability.
MGAudio: Model-Guided Dual-Role Alignment for High-Fidelity Open-Domain Video-to-Audio Generation: This paper proposes MGAudio, the first video-to-audio generation framework that replaces classifier-free guidance (CFG) with model-guided (MG) training, combined with a dual-role audio-video encoder (DRAVE) for simultaneous condition injection and feature alignment. With only 131M parameters, MGAudio achieves state-of-the-art performance on VGGSound (FAD=0.40) and surpasses most competing methods using only 10% of the training data.
Modeling Microenvironment Trajectories on Spatial Transcriptomics with NicheFlow: NicheFlow is a Flow Matching-based generative model that represents cellular microenvironments as point clouds and jointly models the temporal evolution of cell states and spatial coordinates via Variational Flow Matching and optimal transport, substantially outperforming single-cell-level trajectory inference methods on embryonic development, brain development, and aging datasets.
Moment- and Power-Spectrum-Based Gaussianity Regularization for Text-to-Image Models: This paper proposes a unified Gaussianity regularization framework that combines moment matching in the spatial domain with power spectrum matching in the frequency domain. It subsumes existing regularizers (KL divergence, kurtosis, norm) as special cases, and achieves the equivalent effect of PRNO's $\mathcal{O}(D^2)$ approach at $\mathcal{O}(D\log D)$ complexity, significantly outperforming all baselines on reward alignment tasks for text-to-image models.
Multimodal Generative Flows for LHC Jets: This paper proposes a Transformer-based multimodal flow matching framework (MMF) that jointly models continuous flow matching and continuous-time Markov jump bridges, enabling unified generation of particle kinematics (continuous) and flavor quantum numbers (discrete) in LHC jets.
Neural Entropy: This paper explores the connection between deep learning and information theory through the lens of diffusion models, introducing a "neural entropy" measure to quantify the amount of information stored in neural networks during the diffusion process, revealing that image diffusion models achieve remarkably high compression efficiency on structured data.
Next Semantic Scale Prediction via Hierarchical Diffusion Language Models: This paper proposes HDLM (Hierarchical Diffusion Language Model), which introduces cluster tokens with coarse-grained semantics as an intermediate hierarchy between clean tokens and mask tokens, enabling "next semantic scale prediction" in discrete diffusion language modeling. The method derives a closed-form ELBO, achieves consistently lower perplexity than MDLM/GIDD on OpenWebText, and reduces generation perplexity by 62% after stochastic perturbation.
Non-Asymptotic Analysis of Data Augmentation for Precision Matrix Estimation: This paper provides a non-asymptotic analysis of data augmentation (DA) for high-dimensional precision matrix (inverse covariance matrix) estimation. It establishes quadratic error concentration bounds for both linear shrinkage estimators and DA estimators, and introduces a novel deterministic equivalent framework for generalized resolvent matrices with dependent structure.
Non-Markovian Discrete Diffusion with Causal Language Models: This paper proposes CaDDi, a framework that enables each denoising step to access the full generation trajectory via a non-Markovian discrete diffusion process, and unifies this process within a causal language model architecture, allowing pretrained LLMs to be directly reused as discrete diffusion models.
NPN: Non-Linear Projections of the Null-Space for Imaging Inverse Problems: This paper proposes Non-linear Projections of the Null-space (NPN)—a novel regularization strategy that trains a neural network to predict, directly from measurements, the projection coefficients of the ground-truth signal onto a low-dimensional subspace of the sensing matrix's null space. These coefficients serve as prior constraints on "invisible features" and can be flexibly integrated into diverse reconstruction frameworks including PnP, unrolled networks, DIP, and diffusion models. Convergence acceleration within the PnP framework is established theoretically.
ObCLIP: Oblivious Cloud-Device Hybrid Image Generation with Privacy Preservation: ObCLIP is proposed as an oblivious cloud-device hybrid image generation scheme. It expands a user prompt into a set of candidate prompts that differ only in sensitive attributes (e.g., gender, race), performs early denoising steps on all candidates in the cloud without revealing the true prompt, and allows the client to select the correct intermediate latent and complete the remaining denoising locally. Temporal and batch redundancy acceleration techniques reduce the additional overhead to below 4.4–7.6×.
OmniCast: A Masked Latent Diffusion Model for Weather Forecasting Across Time Scales: OmniCast is proposed as a weather forecasting method that combines a masked generative framework with a latent diffusion model. By jointly generating future weather sequences rather than iterating autoregressively, it mitigates error accumulation, achieves state-of-the-art performance at the subseasonal-to-seasonal (S2S) scale, remains competitive for medium-range forecasting, and offers inference speeds 10–20× faster.
OmniSync: Towards Universal Lip Synchronization via Diffusion Transformers: OmniSync proposes a universal lip synchronization framework based on Diffusion Transformers, introducing three key innovations—a mask-free training paradigm, Flow Matching-based progressive noise initialization, and dynamic spatiotemporal CFG—to substantially outperform prior methods on both real and AI-generated videos, achieving an 87.78% success rate on stylized character lip sync (vs. 67.78% for the previous best).
OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions: OmniVCus proposes a feedforward DiT framework that achieves multi-subject, multimodal-controlled video customization through a data construction pipeline called VideoCus-Factory and two novel embedding mechanisms (Lottery Embedding and Temporally Aligned Embedding), significantly surpassing prior SOTA in identity preservation and controllability.
On Optimal Steering to Achieve Exact Fairness: This paper defines the concept of an ideal distribution—a data distribution under which the Bayes-optimal classifier for any cost-sensitive risk satisfies exact fairness—and proposes an optimization framework that identifies the nearest ideal distribution via KL divergence minimization, providing provable fairness guarantees for both fair preprocessing and LLM representation steering.
On the Emergence of Linear Analogies in Word Embeddings: A generative model based on binary semantic attributes is proposed to analytically prove the emergence mechanism of linear analogy structures in word embeddings (e.g., $W_{\text{king}} - W_{\text{man}} + W_{\text{woman}} \approx W_{\text{queen}}$), providing a unified explanation for four key empirical observations.
On the Relation between Rectified Flows and Optimal Transport: This paper presents a rigorous theoretical investigation of the relationship between rectified flows (flow matching) and optimal transport (OT). Through the construction of multiple counterexamples, it demonstrates that previously published claims asserting the asymptotic equivalence between gradient-constrained rectified flows and OT do not hold in general, and that stronger assumptions than those previously identified are required to guarantee such equivalence.
One Stone with Two Birds: A Null-Text-Null Frequency-Aware Diffusion Models for Text-Guided Image Inpainting: NTN-Diff is a frequency-aware diffusion model that decomposes the global semantic consistency problem into per-band consistency tasks over mid-frequency and low-frequency components. By adopting a "null-text–text–null-text" three-stage denoising strategy, the method simultaneously addresses two longstanding challenges in text-guided image inpainting: preserving unmasked regions and maintaining semantic consistency between masked and unmasked areas.
Orient Anything V2: Unifying Orientation and Rotation Understanding: Orient Anything V2 unifies 3D object orientation and rotation understanding via a scalable synthetic data engine, a symmetry-aware periodic distribution objective, and a multi-frame architecture, achieving zero-shot state-of-the-art performance across three tasks: orientation estimation, 6DoF pose estimation, and symmetry recognition.
OSMGen: Highly Controllable Satellite Image Synthesis using OpenStreetMap Data: OSMGen synthesizes high-fidelity satellite images directly from OSM JSON data (vector geometry, semantic tags, location, and temporal information), and generates temporally consistent before-after image pairs via DDIM inversion, enabling urban change simulation and data augmentation.
OVERT: A Benchmark for Over-Refusal Evaluation on Text-to-Image Models: This paper presents OVERT, the first large-scale benchmark for evaluating over-refusal in text-to-image (T2I) models, comprising 4,600 benign prompts and 1,785 harmful prompts across 9 safety categories. It systematically evaluates over-refusal behavior in 5 mainstream T2I models, revealing a strong correlated trade-off between safety and utility.
Pairwise Optimal Transports for Training All-to-All Flow-Based Condition Transfer Model: This paper proposes A2A-FM, a method that simultaneously learns optimal transport mappings across all pairs of conditional distributions within the Flow Matching framework via a novel cost function. It is theoretically shown to converge to pairwise optimal transport in the infinite-sample limit, and is particularly suited for non-grouped data with continuous conditional variables.
Perturb a Model, Not an Image: Towards Robust Privacy Protection via Anti-Personalized Diffusion Models: This paper proposes the Anti-Personalized Diffusion Model (APDM), which for the first time shifts privacy protection from the data level (image perturbation) to the model level (parameter update). Through a Direct Protective Optimization (DPO) loss and a Learning to Protect (L2P) dual-path optimization strategy, APDM robustly prevents diffusion models from personalizing to specific subjects while preserving the model's generation and personalization capabilities for other subjects.
Physics-Constrained Flow Matching: Sampling Generative Models with Hard Constraints: This paper proposes Physics-Constrained Flow Matching (PCFM), a zero-shot inference framework that enforces arbitrary nonlinear equality constraints to machine precision during sampling from pretrained flow matching models. The framework alternates among forward shooting with projection, OT-interpolation backward updates, and relaxed penalty correction at each sub-step, achieving up to 99.5% improvement over baselines on PDE problems involving shocks and discontinuities.
Physics-Driven Spatiotemporal Modeling for AI-Generated Video Detection: A physics conservation law-based paradigm for AI-generated video detection is proposed. A normalized spatiotemporal gradient (NSG) statistic is defined to capture the ratio of spatial probability gradients to temporal density changes. Pre-trained diffusion models are used to estimate NSG, and detection is performed via MMD. The method surpasses the state of the art by 16% in Recall and 10.75% in F1.
PID-controlled Langevin Dynamics for Faster Sampling of Generative Models: This work introduces PID control theory into Langevin dynamics sampling, leveraging gradient history (integral term) to build momentum for traversing energy barriers and gradient trends (derivative term) to suppress oscillations, achieving fast and stable convergence. The approach requires no additional training and delivers over 10× sampling acceleration on both SGMs and EBMs.
PixPerfect: Seamless Latent Diffusion Local Editing with Discriminative Pixel-Space Refinement: This paper proposes PixPerfect, a general-purpose pixel-level refinement framework that eliminates color discrepancies, texture mismatches, and visible seams in local editing with latent diffusion models (LDMs) through a discriminative pixel-space loss and a comprehensive artifact simulation pipeline, achieving substantial improvements in visual fidelity across inpainting, object removal, and insertion tasks.
Pragmatic Heterogeneous Collaborative Perception via Generative Communication Mechanism: This paper proposes GenComm — a generative communication mechanism for heterogeneous multi-agent collaborative perception. By extracting spatial messages and employing a conditional diffusion model, the ego agent locally generates aligned collaborator features without modifying any existing network, enabling new heterogeneous agents to be onboarded at minimal cost.
Preconditioned Langevin Dynamics with Score-Based Generative Models for Infinite-Dimensional Linear Bayesian Inverse Problems: This paper rigorously analyzes score-based generative model (SGM)-driven Langevin posterior samplers in infinite-dimensional Hilbert spaces, derives for the first time convergence bounds that explicitly depend on score approximation errors, and identifies an optimal preconditioner that jointly depends on the forward operator and score errors, guaranteeing uniform convergence rates across all posterior modes.
Predictive Feature Caching for Training-free Acceleration of Molecular Geometry Generation: This paper transfers predictive feature caching strategies from the image generation domain to molecular geometry generation, exploiting the temporal smoothness of hidden states along sampling trajectories to achieve training-free 2–3× inference acceleration, with up to 7× speedup when combined with other optimization techniques.
Preventing Shortcuts in Adapter Training via Providing the Shortcuts: This paper proposes Shortcut-Rerouted Adapter Training, which actively provides dedicated pathways for confounding factors during adapter training (e.g., a LoRA absorbing distribution shifts, a ControlNet absorbing pose/expression), thereby constraining the adapter to learn only the target attribute (e.g., identity). The auxiliary modules are discarded at inference time, yielding a disentangled adapter.
Progressive Inference-Time Annealing of Diffusion Models for Sampling from Boltzmann Densities: This paper proposes PITA (Progressive Inference-Time Annealing), a framework that combines temperature annealing and diffusion smoothing as two complementary interpolation strategies. PITA trains an initial diffusion model at high temperature, then applies a novel Feynman-Kac PDE with SMC resampling to progressively anneal toward lower temperatures at inference time, training a sequence of diffusion models up to the target temperature. This approach achieves equilibrium sampling of alanine dipeptide and tripeptide in Cartesian coordinates for the first time.
Prompt-Based Safety Guidance Is Ineffective for Unlearned Text-to-Image Diffusion Models: This paper identifies that combining training-based concept unlearning with training-free safety guidance (negative prompt guidance) yields degraded performance, and proposes replacing explicit negative prompts with implicit concept embeddings obtained via Concept Inversion, effectively restoring the defensive capability of training-free methods on unlearned models.
Ψ-Sampler: Initial Particle Sampling for SMC-Based Inference-Time Reward Alignment in Score Models: This paper proposes the Ψ-Sampler framework, which introduces initial particle sampling based on the preconditioned Crank-Nicolson Langevin (pCNL) algorithm into SMC-based inference-time reward alignment. By initializing particles from a reward-aware posterior distribution, the framework substantially improves alignment performance on layout-guided generation, quantity-aware generation, and aesthetic preference generation.
Rare Text Semantics Were Always There in Your Diffusion Transformer: This paper discovers that scaling up the variance of text token embeddings before the joint attention blocks in MM-DiT enables diffusion models to render rare text semantics, without any additional training or external modules.
Real-Time Execution of Action Chunking Flow Policies: This paper proposes Real-Time Chunking (RTC), which frames asynchronous action chunk execution as an inpainting problem. By freezing already-executed actions and "inpainting" the remainder to be consistent with the prefix, RTC enables smooth real-time execution of diffusion/flow policies without any retraining.
Rectified-CFG++ for Flow Based Models: To address the off-manifold drift caused by standard CFG in Rectified Flow models, this paper proposes Rectified-CFG++—an adaptive predictor-corrector guidance strategy that replaces extrapolative guidance with conditional flow prediction combined with time-scheduled interpolative correction. The method comprehensively outperforms standard CFG on large-scale models including Flux, SD3, SD3.5, and Lumina.
Recurrent Memory for Online Interdomain Gaussian Processes: This paper proposes OHSVGP (Online HiPPO Sparse Variational Gaussian Process), which introduces the HiPPO (High-order Polynomial Projection Operator) framework from deep learning into sparse variational Gaussian processes as interdomain inducing variables. By leveraging time-varying orthogonal polynomial basis functions, the method achieves long-term memory retention in online learning, with kernel matrices updated efficiently via ODE recursion.
Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models: This paper proposes the Diffusion Chain of Lateral Thought (DCoLT), which treats each intermediate step in the reverse process of a diffusion language model as a latent "thinking" action and optimizes the entire reasoning trajectory via outcome-based reinforcement learning. DCoLT achieves state-of-the-art performance on mathematics and code generation benchmarks with both SEDD and LLaDA diffusion language models.
Remasking Discrete Diffusion Models with Inference-Time Scaling: This paper proposes the ReMDM sampler, which enables iterative error correction in discrete mask diffusion models by allowing already-decoded tokens to be remasked during generation. This mechanism supports inference-time compute scaling and yields substantial quality improvements on text, image, and molecular design tasks.
RepLDM: Reprogramming Pretrained Latent Diffusion Models for High-Quality, High-Efficiency, High-Resolution Image Generation: This paper proposes RepLDM, a reprogramming framework that enables pretrained latent diffusion models to generate high-quality, high-resolution images without retraining, via two stages: an attention guidance stage and a progressive upsampling stage, while substantially improving efficiency.
RespoDiff: Dual-Module Bottleneck Transformation for Responsible & Faithful T2I Generation: This paper proposes RespoDiff, a framework that introduces two learnable transformation modules at the bottleneck layer of a diffusion model UNet — a Responsibility Alignment Module (RAM) and a Semantic Alignment Module (SAM) — trained via score matching objectives to achieve fair and safe text-to-image generation while preserving image quality and semantic fidelity.
Riemannian Consistency Model: This work is the first to extend Consistency Models (CM) to Riemannian manifolds. By leveraging exponential map parameterization and covariant derivatives, it derives both discrete- and continuous-time RCM objectives, enabling high-quality few-step generation on non-Euclidean geometries such as spheres, flat tori, and SO(3).
Riemannian Flow Matching for Brain Connectivity Matrices via Pullback Geometry: This paper proposes DiffeoCFM, which leverages pullback metrics induced by global diffeomorphisms to equivalently reformulate conditional flow matching on Riemannian manifolds as standard CFM in Euclidean space. The method enables efficient generation of brain connectivity matrices (SPD/correlation) while strictly preserving manifold constraints, achieving state-of-the-art performance on 3 fMRI and 2 EEG datasets.
RLVR-World: Training World Models with Reinforcement Learning: This paper proposes the RLVR-World framework, extending the Reinforcement Learning with Verifiable Rewards (RLVR) paradigm to world model training. By directly optimizing target metrics (e.g., prediction accuracy, perceptual quality) as verifiable rewards, the framework achieves significant improvements on both language and video world models.
RLZero: Direct Policy Inference from Language Without In-Domain Supervision: This paper proposes RLZero, a framework that converts natural language instructions into behavioral policies in target environments via an "Imagine → Project → Imitate" pipeline. A video generation model is used to "imagine" observation sequences from language; these are then projected into the target domain; finally, an unsupervised pretrained RL agent imitates the projected sequences via a closed-form solution — all without any in-domain supervision or annotated trajectories.
Robustness in Both Domains: CLIP Needs a Robust Text Encoder: This paper proposes LEAF (Levenshtein Efficient Adversarial Finetuning), the first adversarial fine-tuning method targeting the CLIP text encoder. LEAF substantially improves robustness under character-level text perturbations across zero-shot classification, text-image retrieval, and image generation, while preserving performance in the image domain.
Safe and Stable Control via Lyapunov-Guided Diffusion Models: This paper proposes S²Diff, a model-based diffusion planning framework that leverages Control Lyapunov Barrier Functions (CLBF) to guide diffusion sampling for generating trajectory-level control policies. Without requiring control-affine assumptions or quadratic programming, S²Diff simultaneously guarantees safety and stability on a variety of nonlinear dynamical systems, achieving an average safety rate of 98.75%.
SAO-Instruct: Free-form Audio Editing using Natural Language Instructions: This paper proposes SAO-Instruct, the first audio editing model supporting fully free-form natural language instructions. Training data consisting of editing triplets is constructed via three pipelines — Prompt-to-Prompt, DDPM inversion, and manual editing — and Stable Audio Open is fine-tuned to achieve context-preserving, targeted audio modification.
Scalable, Explainable and Provably Robust Anomaly Detection with One-Step Flow Matching: This paper proposes TCCM (Time-Conditioned Contraction Matching), a flow matching-inspired semi-supervised anomaly detection method for tabular data. By learning a time-conditioned velocity field that contracts normal data toward the origin, TCCM computes anomaly scores in a single forward pass, achieving top AUROC and AUPRC rankings across 47 ADBench datasets while running 1573× faster than DTE.
ScaleDiff: Higher-Resolution Image Synthesis via Efficient and Model-Agnostic Diffusion: ScaleDiff is a framework that eliminates redundant overlap computation in conventional patch-based methods via Neighborhood Patch Attention (NPA). Combined with Latent Frequency Mixing (LFM) and Structure Guidance (SG), it extends pretrained diffusion models to high resolutions (e.g., 4096²) without any additional training, achieving state-of-the-art quality among training-free methods and significant inference acceleration (8.9× faster than DemoFusion) on both U-Net and DiT architectures.
Scaling Can Lead to Compositional Generalization: Through theoretical proofs and large-scale experiments, this paper demonstrates that standard MLPs can achieve compositional generalization solely by scaling data and model size, without explicit modular architectural design. Moreover, when compositional generalization succeeds, task components can be linearly decoded from hidden-layer activations — a metric that correlates positively with compositional success rates in diffusion-based image generation.
Scaling Diffusion Transformers Efficiently via μP: This paper extends Maximal Update Parametrization (μP) from standard Transformers to diffusion Transformers (DiT, PixArt-α, MMDiT, etc.), demonstrating that optimal hyperparameters found on small proxy models transfer stably to large models, significantly reducing the hyperparameter tuning cost for large-scale diffusion models.
Scaling Offline RL via Efficient and Expressive Shortcut Models: This paper proposes SORL, which leverages the self-consistency property of shortcut models to enable efficient single-stage training with variable inference steps for policy optimization in offline RL, while supporting both sequential and parallel test-time scaling.
SceneDecorator: Towards Scene-Oriented Story Generation with Scene Planning and Scene Consistency: SceneDecorator presents a training-free framework that, for the first time, systematically addresses scene planning and scene consistency in story generation via VLM-guided global-to-local scene planning and a long-term scene-sharing attention mechanism, achieving significant improvements over existing methods on scene alignment and consistency metrics.
SceneDesigner: Controllable Multi-Object Image Generation with 9-DoF Pose Manipulation: SceneDesigner introduces a CNOCS map representation combined with a two-stage reinforcement learning training strategy, achieving for the first time precise 9D pose control (position, size, and orientation) over multiple objects, significantly outperforming existing methods in both controllability and generation quality.
Schrödinger Bridge Matching for Tree-Structured Costs and Entropic Wasserstein Barycentres: This paper extends the Iterative Markovian Fitting (IMF) procedure to the tree-structured Schrödinger Bridge problem, proposing the TreeDSBM algorithm. For Wasserstein barycentre computation, it elegantly merges IMF iterations with fixed-point iterations, requiring only inexpensive bridge-matching steps for efficient solution.
Score-informed Neural Operator for Enhancing Ordering-based Causal Discovery: This paper proposes SciNO (Score-informed Neural Operator), a probabilistic generative model designed in a smooth function space that stably approximates the log-density Hessian diagonal to improve ordering-based causal discovery, achieving a 42.7% reduction in order divergence on synthetic graphs and 31.5% on real-world data.
Semantic Surgery: Zero-Shot Concept Erasure in Diffusion Models: This paper proposes Semantic Surgery, a training-free zero-shot inference-time concept erasure framework that calibrates text embeddings via vector subtraction prior to the diffusion process, incorporates Co-Occurrence Encoding for multi-concept erasure, and employs a visual feedback loop to address latent concept persistence (LCP). The method comprehensively outperforms state-of-the-art approaches across object, NSFW, style, and celebrity erasure tasks.
Shallow Diffuse: Robust and Invisible Watermarking through Low-Dimensional Subspaces in Diffusion Models: This paper proposes Shallow Diffuse, a method that exploits the local linearity and low-rank Jacobian of the posterior mean predictor (PMP) in diffusion models to embed watermarks at intermediate diffusion timesteps. This design decouples the watermark from the generation process, achieving, for the first time, both high fidelity and high robustness simultaneously under both server-side and user-side deployment scenarios.
Shortcutting Pre-trained Flow Matching Diffusion Models is Almost Free Lunch: This paper proposes SCFM (ShortCutting Flow Matching), an highly efficient post-training distillation method that compresses pre-trained flow matching models (e.g., Flux with 12B parameters) into 3-step samplers via velocity field self-distillation, requiring less than 1 A100-Day, without step-size embeddings or adversarial distillation.
Show-o2: Improved Native Unified Multimodal Models: This paper presents Show-o2, a natively unified multimodal model built upon autoregressive modeling and Flow Matching. By constructing unified visual representations in a 3D causal VAE space via dual-path spatial(-temporal) fusion, Show-o2 supports multimodal understanding and generation across text, images, and video, with a two-stage training strategy that effectively preserves language knowledge.
SparseDiT: Token Sparsification for Efficient Diffusion Transformer: This paper proposes SparseDiT, which achieves 55% FLOPs reduction and 175% inference throughput improvement on DiT-XL 512×512 with only 0.09 FID degradation. The method employs a three-stage spatial architecture (bottom Poolingformer + middle Sparse-Dense Token Module + top full-density processing) combined with a dynamic pruning-rate schedule along the temporal dimension, and successfully extends to video generation and text-to-image generation tasks.
Split Gibbs Discrete Diffusion Posterior Sampling: This paper proposes SGDD (Split Gibbs Discrete Diffusion), a plug-and-play posterior sampling algorithm for discrete diffusion models based on the split Gibbs sampling principle. By introducing auxiliary variables and a Hamming-distance-based regularization potential, SGDD decomposes posterior sampling into alternating likelihood and prior sampling steps, achieving substantial improvements over baselines on DNA sequence design, discrete image inverse problems, and music infilling tasks.
SplitFlow: Flow Decomposition for Inversion-Free Text-to-Image Editing: SplitFlow decomposes a target prompt into multiple sub-prompts, computes an independent editing flow for each, and combines them into a unified editing trajectory via latent trajectory projection and adaptive velocity field aggregation. This resolves gradient entanglement and achieves higher fidelity and editability in text-guided image editing without requiring inversion.
StableGuard: Towards Unified Copyright Protection and Tamper Localization in Latent Diffusion Models: StableGuard embeds global binary watermarks into the LDM generation pipeline (via MPW-VAE) and leverages changes in watermark perturbation patterns for tamper localization (via MoE-GFN), achieving the first end-to-end unified framework for copyright protection and tamper detection.
State-Covering Trajectory Stitching for Diffusion Planners: This paper proposes SCoTS (State-Covering Trajectory Stitching), a reward-free trajectory augmentation framework that iteratively stitches short trajectory segments in a temporal distance-preserving latent space to systematically expand state-space coverage, significantly improving the generalization of diffusion planners on long-horizon and out-of-distribution tasks.
StelLA: Subspace Learning in Low-rank Adaptation using Stiefel Manifold: This paper proposes StelLA, which decomposes the LoRA adaptation matrix into a three-factor form $USV^\top$ and constrains $U$ and $V$ to the Stiefel manifold for Riemannian optimization, enabling explicit subspace learning during training. StelLA consistently outperforms existing LoRA variants across multiple downstream tasks.
System-Embedded Diffusion Bridge Models: This paper proposes System-embedded Diffusion Bridge Models (SDB), which directly embed a known linear measurement system into the coefficients of a matrix-valued SDE, enabling decoupled control over denoising in the range space and information synthesis in the null space. SDB achieves consistent performance improvements across multiple inverse problems and demonstrates strong robustness to system mismatch.
T2SMark: Balancing Robustness and Diversity in Noise-as-Watermark for Diffusion Models: This paper proposes T2SMark, a two-stage watermarking scheme for diffusion models based on Tail-Truncated Sampling (TTS). By embedding watermark bits in the tail regions of the Gaussian distribution while sampling randomly from the central region, T2SMark is the first method to achieve an optimal balance between watermark robustness and generation diversity.
Text-to-Image Models Leave Identifiable Signatures: Implications for Leaderboard Security: This paper demonstrates that text-to-image (T2I) models leave identifiable "signatures" in their generated images due to differences in training data, architecture, and scale. Even without controlling the input prompt, an adversary can de-anonymize models on leaderboards via simple centroid classification in CLIP embedding space, achieving 87% Top-1 accuracy, thereby enabling ranking manipulation attacks.
Text to Sketch Generation with Multi-Styles: This paper proposes M3S (Multi-Style Sketch Synthesis), a training-free framework that achieves single- and multi-style sketch generation conditioned on text prompts and reference style sketches, via linearly smoothed K/V feature injection, joint AdaIN style tendency control, and style-content disentangled guidance.
ThermalGen: Style-Disentangled Flow-Based Generative Models for RGB-to-Thermal Image Translation: This paper proposes ThermalGen, an adaptive flow-based generative model that achieves, for the first time, high-fidelity RGB-to-Thermal image translation across diverse viewpoints, sensors, and environmental conditions via an RGB-conditioned architecture and a style disentanglement mechanism. Three new large-scale satellite–aerial RGB-T paired datasets are also released.
TIDMAD: Time Series Dataset for Discovering Dark Matter with AI Denoising: This work introduces TIDMAD — the first ultra-long time series denoising benchmark dataset for dark matter searches — comprising training/validation/science data from the ABRACADABRA experiment, a denoising score metric, and a complete analysis pipeline, enabling AI algorithms to directly produce physics-community-standard dark matter search results.
Token Perturbation Guidance for Diffusion Models: This paper proposes Token Perturbation Guidance (TPG), which constructs a negative score signal by applying norm-preserving shuffling perturbations to intermediate token representations in diffusion models, enabling training-free, condition-agnostic guidance. TPG improves the FID of SDXL by nearly 2× in unconditional generation and approaches CFG-level performance in conditional generation.
Tortoise and Hare Guidance: Accelerating Diffusion Model Inference with Multirate Integration: This paper proposes Tortoise and Hare Guidance (THG), a training-free acceleration strategy for diffusion sampling that reformulates the classifier-free guidance (CFG) ODE as a multirate ODE system. The noise estimation term is integrated with fine-grained steps (tortoise equation), while the additional guidance term is integrated with coarse-grained steps (hare equation), reducing the number of function evaluations (NFE) by up to 30% with negligible degradation in generation quality.
Toward a Unified Geometry Understanding: Riemannian Diffusion Framework for Graph Generation and Prediction: This paper proposes GeoMancer, a framework that replaces numerically unstable exponential maps with a Riemannian GyroKernel autoencoder to disentangle multi-level graph features onto task-specific product manifolds. By further introducing manifold-constrained diffusion and a self-guidance generation strategy, GeoMancer achieves unified modeling and state-of-the-art performance across molecular generation, node classification, and graph regression tasks.
Towards a Golden Classifier-Free Guidance Path via Foresight Fixed Point Iterations: This paper unifies conditional guidance under a fixed-point iteration framework, showing that CFG and its variants are all special cases of single-step iterations over short intervals. It theoretically proves their suboptimality and proposes Foresight Guidance (FSG)—performing multi-step iterations over longer intervals in early diffusion stages to achieve better alignment quality with less computation.
Towards General Modality Translation with Contrastive and Predictive Latent Diffusion Bridge: This paper proposes LDDBM (Latent Denoising Diffusion Bridge Model), which extends denoising diffusion bridge models into a shared latent space and incorporates contrastive alignment loss and predictive loss to achieve a general-purpose framework for arbitrary modality translation.
Towards Resilient Safety-Driven Unlearning for Diffusion Models Against Downstream Fine-tuning: This paper proposes ResAlign, a framework that leverages Moreau envelope approximation and meta-learning to make safety-driven unlearning in diffusion models resilient against harmful capability recovery induced by downstream fine-tuning, even when fine-tuning is performed exclusively on benign data.
Towards Robust Zero-Shot Reinforcement Learning: This paper proposes BREEZE, a framework that systematically addresses out-of-distribution (OOD) extrapolation errors and insufficient expressivity in FB-based zero-shot RL through behavior-regularized representation guidance, task-conditioned diffusion policy extraction, and attention-enhanced representation modeling. BREEZE achieves state-of-the-art or near-state-of-the-art robust zero-shot generalization on ExORL and D4RL Kitchen benchmarks.
Track, Inpaint, Resplat: Subject-driven 3D and 4D Generation with Progressive Texture Infilling: This paper proposes TIRE (Track, Inpaint, REsplat), a three-stage pipeline that locates unobserved regions via video tracking, progressively infills textures using a subject-driven inpainting model, and back-projects multi-view consistent results into 3D, enabling identity-preserving 3D/4D generation.
Training-Free Constrained Generation with Stable Diffusion Models: This paper proposes a training-free constrained generation method that embeds Proximal Langevin Dynamics into the reverse denoising process of Stable Diffusion. Image-space constraints are backpropagated to the latent space via the decoder, enabling strict constraint satisfaction on generated outputs without retraining.
Training-Free Safe Text Embedding Guidance for Text-to-Image Diffusion Models: This paper proposes Safe Text embedding Guidance (STG), a training-free approach for safe text-to-image generation that dynamically adjusts text embeddings during diffusion sampling based on a safety function evaluated on the expected denoised image. STG effectively removes unsafe content while maximally preserving the original semantic intent.
Transferable Black-Box One-Shot Forging of Watermarks via Image Preference Models: This paper proposes a black-box watermark forging method based on image preference models. Given only a single watermarked image, the method extracts the watermark via backpropagation and transfers it to arbitrary new images, effectively forging multiple post-hoc watermarking schemes without access to the underlying watermarking algorithm.
Tree-Guided Diffusion Planner: This paper proposes the Tree-guided Diffusion Planner (TDP), which formalizes test-time diffusion planning as a tree search problem. Through bi-level sampling—particle-guided generation of diverse parent trajectories for exploration, combined with fast conditional denoising to generate child trajectories for exploitation—TDP achieves a strong exploration–exploitation balance and substantially outperforms existing methods under non-convex objectives and non-differentiable constraints.
Two-Steps Diffusion Policy for Robotic Manipulation via Genetic Denoising: By revealing the distributional mismatch caused by clipping operations in diffusion policies, this paper proposes GDP—a method combining denoising schedule optimization and genetic algorithm-based population selection—that enables off-the-shelf DDPM diffusion policies to match or surpass 100-step baselines with only 2-step inference, without any retraining.
UltraHR-100K: Enhancing UHR Image Synthesis with A Large-Scale High-Quality Dataset: This work constructs UltraHR-100K, a large-scale dataset comprising 100K ultra-high-resolution images with rich annotations, and proposes a Frequency-Aware Post-Training (FAPT) method combining Detail-Oriented Timestep Sampling (DOTS) and Soft-Weighted Frequency Regularization (SWFR) based on DFT, enabling pretrained T2I models to generate fine-grained details at ultra-high resolutions.
Understand Before You Generate: Self-Guided Training for Autoregressive Image Generation: By systematically analyzing three key properties that hinder visual semantic learning in autoregressive image generation — local conditional dependency, inter-step semantic inconsistency, and the absence of spatial invariance — this paper proposes ST-AR, a training framework that incorporates masked image modeling and contrastive learning into the next-token prediction objective. Without relying on any pretrained representation model, ST-AR improves the FID of LlamaGen-XL by approximately 49% (from 19.42 to 9.81), achieving performance comparable to a 3B-parameter model trained for 300 epochs within only 50 epochs.
Understanding Representation Dynamics of Diffusion Models via Low-Dimensional Models: Under a Mixture of Low-Rank Gaussians (MoLRG) data model, this paper theoretically proves that the unimodal dynamics of representation quality across noise levels arise from a trade-off between denoising strength and class discriminability, and empirically demonstrates that the emergence of unimodal dynamics serves as a reliable indicator of model generalization.
UniLumos: Fast and Unified Image and Video Relighting with Physics-Plausible Feedback: This paper proposes UniLumos, a unified image and video relighting framework that enhances physical plausibility by incorporating RGB-space depth and normal geometry feedback into a flow matching backbone, while achieving 20× inference speedup through path consistency learning.
Unleashing Diffusion Transformers for Visual Correspondence by Modulating Massive Activations: This paper identifies the massive activations phenomenon in Diffusion Transformers (DiTs) that renders features indiscriminable, reveals its intrinsic connection to AdaLN, and proposes a training-free framework DiTF for extracting semantically discriminative features, surpassing DINO and SD models on visual correspondence tasks.
UtilGen: Utility-Centric Generative Data Augmentation with Dual-Level Task Adaptation: This paper proposes UtilGen, a utility-centric generative data augmentation framework that evaluates the downstream task utility of synthetic data via a meta-learning weight network, and employs a dual-level optimization strategy—model-level DPO and instance-level (prompt + noise) optimization—to adaptively generate high-utility synthetic training data, achieving an average improvement of 3.87% across 8 benchmarks.
V-CECE: Visual Counterfactual Explanations via Conceptual Edits: V-CECE proposes the first black-box visual counterfactual explanation framework that systematically reveals the explanatory gap between human and neural network semantic understanding. It guarantees edit-set optimality via WordNet knowledge graphs and the Hungarian algorithm, and executes concept-level edits using Stable Diffusion. The key finding is that CNN classifiers are severely misaligned with human semantic reasoning (requiring 5+ edit steps), whereas LVLMs (Claude 3.5 Sonnet) are highly aligned with humans (requiring only 2–3 steps).
Value Gradient Guidance for Flow Matching Alignment: This paper proposes VGG-Flow, which leverages the Hamilton-Jacobi-Bellman (HJB) equation from optimal control theory to reformulate flow matching alignment as a gradient matching task—matching the residual velocity field to the gradient of the value function—enabling efficient reward alignment while preserving the prior distribution.
Vicinity-Guided Discriminative Latent Diffusion for Privacy-Preserving Domain Adaptation: This paper proposes Discriminative Vicinity Diffusion (DVD), which for the first time employs latent diffusion models for discriminative knowledge transfer. By training a diffusion model within the vicinity latent space of source-domain features to generate source-style cues, DVD enables domain adaptation without access to source data, surpassing state-of-the-art methods on standard SFDA benchmarks.
Watermarking Autoregressive Image Generation: This paper is the first to adapt LLM watermarking (KGW green/red scheme) to the token level of autoregressive image generation models. It identifies and addresses the key challenge of insufficient Reverse Cycle Consistency (RCC) through tokenizer–detokenizer fine-tuning and a watermark synchronization layer, achieving robust image watermark detection with theoretical guarantees.
What We Don't C: Manifold Disentanglement for Structured Discovery: This paper proposes WWDC (What We Don't C), a method that employs conditionally guided latent flow matching to remove known information from existing VAE representations, enabling unknown features to be more readily discovered and accessed in the residual manifold, thus facilitating iterative scientific discovery.
When Are Concepts Erased From Diffusion Models?: This paper proposes two mechanistic models of concept erasure (guidance-based avoidance vs. destruction-based removal) and designs a suite of five independent probing methods—spanning optimization search, in-context probing, noise trajectory probing, classifier-guided probing, and dynamic concept tracing—to systematically demonstrate that most existing erasure methods merely "circumvent" concepts rather than genuinely eliminating the underlying knowledge.
Where and How to Perturb: On the Design of Perturbation Guidance in Diffusion and Flow Models: This paper proposes the HeadHunter framework and SoftPAG method, refining the granularity of attention perturbation in diffusion models from the layer level down to individual attention heads. It is the first work to reveal that different attention heads govern distinct visual concepts (structure, style, texture, etc.), enabling more precise and composable generation guidance.
Why Diffusion Models Don't Memorize: The Role of Implicit Dynamical Regularization in Training: Through numerical experiments and theoretical analysis, this paper identifies two critical timescales in diffusion model training — a generalization time $\tau_{\text{gen}}$ and a memorization time $\tau_{\text{mem}}$ — where the latter scales linearly with training set size $n$ while the former remains constant. The resulting implicit dynamical regularization enables early stopping to prevent memorization even in heavily overparameterized regimes.
Why Diffusion Models Don't Memorize: The Role of Implicit Regularization: This paper reveals, through both numerical experiments and theoretical analysis, an implicit dynamical regularization mechanism in diffusion model training: the gap between the timescale for generating high-quality samples $\tau_\text{gen}$ and the timescale for memorization $\tau_\text{mem}$ grows linearly with training set size $n$, providing theoretical justification for early stopping.
Why Knowledge Distillation Works in Generative Models: A Minimal Working Explanation: Through theoretical analysis on Gaussian mixture models and large-scale experiments on the SmolLM2 family via multi-level distillation, this paper reveals the core mechanism of knowledge distillation in generative models: distillation induces a tradeoff in the student model between precision (generation quality) and recall (distribution coverage), governed by the entropy of the teacher distribution.
WMCopier: Forging Invisible Image Watermarks on Arbitrary Images: This paper proposes WMCopier, the first diffusion-model-based no-box watermark forging attack that requires no prior knowledge of the target watermarking algorithm. By training an unconditional diffusion model to learn the watermark distribution, injecting watermark signals via shallow DDIM inversion, and refining results through iterative optimization, WMCopier achieves high forging success rates against both open-source and commercial watermarking systems, including Amazon.

🎮 Reinforcement Learning¶

A Differential and Pointwise Control Approach to Reinforcement Learning: This paper reformulates the RL problem via the differential dual form of continuous-time control, embeds physical priors through Hamiltonian structure, and proposes the dfPO algorithm for pointwise policy optimization. On scientific computing tasks (surface modeling, grid-based control, molecular dynamics), dfPO surpasses 12 RL baselines with fewer samples.
A Generalized Bisimulation Metric of State Similarity between Markov Decision Processes: From Theoretical Propositions to Applications: This paper extends the classical bisimulation metric (BSM), which is limited to measuring state similarity within a single MDP, to cross-MDP settings by proposing a Generalized Bisimulation Metric (GBSM). The authors rigorously prove three fundamental metric properties — symmetry, cross-MDP triangle inequality, and an upper bound on same-state distances — and derive tighter error bounds and closed-form sample complexities than standard BSM in three applications: policy transfer, state aggregation, and sampling-based estimation.
A Near-optimal, Scalable and Parallelizable Framework for Stochastic Bandits Robust to Adversarial Corruptions and Beyond: This paper proposes BARBAT, an improvement over the classical BARBAR algorithm. By fixing epoch lengths and adjusting failure probabilities per epoch, BARBAT reduces the regret of stochastic multi-armed bandits under adversarial corruptions from $O(\sqrt{K}C)$ to the near-optimal $O(C)$ (eliminating the $\sqrt{K}$ factor), and successfully extends to multi-agent, graph bandit, combinatorial semi-bandit, and batched bandit settings.
A Theory of Multi-Agent Generative Flow Networks: This paper proposes a theoretical framework for Multi-Agent Generative Flow Networks (MA-GFlowNets) and establishes a "local-global principle" — the joint flow function can be decomposed into a product of individual agents' local flows. Four algorithms are designed (CFN/IFN/JFN/CJFN), among which JFN and CJFN realize Centralized Training with Decentralized Execution (CTDE). The proposed methods outperform RL and MCMC baselines on Hyper-Grid and StarCraft environments.
A Unifying View of Linear Function Approximation in Off-Policy RL Through Matrix Splitting and Preconditioning: This work is the first to introduce matrix splitting theory, unifying TD, FQI, and PFQI under linear function approximation as iterative methods for solving the same target linear system $(\Sigma_{cov} - \gamma\Sigma_{cr})\theta = \theta_{\phi,r}$, differing only in their preconditioners. It establishes necessary and sufficient conditions for the convergence of each algorithm, introduces the novel concept of rank invariance, and reveals that target networks are fundamentally a continuous transformation of the preconditioner from a constant to a data-adaptive form.
Act to See, See to Act: Diffusion-Driven Perception-Action Interplay for Adaptive Policies: This paper proposes DP-AG (Action-Guided Diffusion Policy), which uses the Vector-Jacobian Product (VJP) of a diffusion policy's noise prediction as a structured stochastic force to drive dynamic evolution of latent observation features across diffusion steps, and closes the perception-action loop via a cycle-consistent contrastive loss. DP-AG achieves +6% on Push-T, +13% on Dynamic Push-T, and +23%+ success rate on a real UR5 robot.
Actor-Free Continuous Control via Structurally Maximizable Q-Functions: This paper proposes Q3C (Q-learning for Continuous Control with Control-points), which approximates the Q-function via a learned set of control points such that the maximum value is structurally attained at one of those points. Combined with action-conditioned Q-value generation, a control-point diversity loss, and scale normalization, Q3C matches TD3 on standard benchmarks and substantially outperforms all actor-critic methods in constrained action spaces.
Adaptive Cooperative Transmission Design for URLLC via Deep RL: This paper proposes DRL-CoLA, a dual-agent DQN algorithm that adaptively configures 5G NR transmission parameters (numerology, mini-slot, MCS) at the source and relay nodes respectively. Operating over a two-hop relay system with only local CSI, DRL-CoLA achieves URLLC reliability close to the optimum attained under full global CSI.
Adaptive Neighborhood-Constrained Q Learning for Offline Reinforcement Learning: This paper proposes ANQ (Adaptive Neighborhood-constrained Q learning), which introduces advantage-function-based adaptive neighborhood constraints for offline RL. ANQ offers a flexible middle ground between density constraints (overly conservative) and support constraints (requiring precise behavior policy modeling), and realizes efficient Q learning via a bilevel optimization framework, achieving state-of-the-art performance on the D4RL benchmark.
Adaptively Coordinating with Novel Partners via Learned Latent Strategies: This paper proposes the TALENTS framework, which learns a latent strategy space via a VAE, discovers strategy types through K-Means clustering, and performs online teammate-type inference using the Fixed-Share regret minimization algorithm, enabling zero-shot real-time adaptive coordination with unknown human or agent teammates.
ALINE: Joint Amortization for Bayesian Inference and Active Data Acquisition: ALINE proposes a unified framework for amortized Bayesian inference and active data acquisition. By combining a Transformer architecture with RL-based training, the model simultaneously learns to strategically select the most informative data points and perform instant posterior inference. It further supports flexible data acquisition targeting specific parameter subsets or predictive objectives.
Approximating Shapley Explanations in Reinforcement Learning: This paper proposes FastSVERL, a scalable parametric learning framework that separately approximates the two computational bottlenecks of Shapley values in reinforcement learning—the characteristic function and the Shapley summation—while supporting off-policy learning and continuous explanation updates as the policy evolves.
Automaton Constrained Q-Learning: This paper proposes ACQL (Automaton Constrained Q-Learning), which translates Linear Temporal Logic (LTL) task specifications into automata and combines goal-conditioned learning with minimal safety constraints. ACQL is the first scalable method to simultaneously support sequential temporal goals and non-stationary safety constraints in continuous control environments.
Bandit and Delayed Feedback in Online Structured Prediction: This paper is the first to study bandit and delayed feedback settings in online structured prediction. By designing a novel pseudo-inverse matrix gradient estimator, it achieves an $O(T^{2/3})$ surrogate regret bound that does not explicitly depend on the output set size $K$.
BEAST: Efficient Tokenization of B-Splines Encoded Action Sequences for Imitation Learning: BEAST parameterizes action sequences via B-splines—estimating control points through ridge regression and uniformly quantizing them into fixed-length tokens—achieving 20× token compression (100 steps → 5 tokens), mathematically guaranteed $C^0$ continuity across action chunks, a top-1 success rate on LIBERO-Long (86.4%), and an inference throughput of 617 Hz (2.14× faster than π₀ and 101× faster than OpenVLA).
Behavior Injection: Preparing Language Models for Reinforcement Learning: This paper identifies the root cause of inconsistent LLM responses to RL fine-tuning. Through per-step influence analysis, it reveals that RL effectiveness depends on (1) the distribution of rollout accuracy (moderate is optimal) and (2) data co-influence magnitude. The proposed BRIDGE method injects exploration/exploitation behaviors during SFT, boosting subsequent RL gains from 6% to 46.6%.
Blending Complementary Memory Systems in Hybrid Quadratic-Linear Transformers: This paper proposes Hybrid Quadratic-Linear Transformers (HQLT), which integrate KV-memory (softmax attention: precise retrieval but quadratic complexity) and FW-memory (DeltaNet/linear attention: linear complexity but coarse retrieval) as complementary memory systems. Three hybrid strategies are compared (Delayed-Streaming, Delayed-Chunk, and Synchronous), and the Synchronous variant is shown to be optimal across language modeling, retrieval, algorithmic reasoning, and RL tasks at 340M and 1.3B parameter scales.
Bootstrap Off-policy with World Model (BOOM): This paper proposes the BOOM framework, which tightly couples an online planner (MPPI) with off-policy policy learning via a bootstrap loop: the policy initializes the planner, which in turn guides policy improvement through a likelihood-free alignment loss, supplemented by a soft Q-weighted mechanism to prioritize high-return behaviors, achieving state-of-the-art performance on high-dimensional continuous control tasks.
Bootstrap Off-policy with World Model: This paper proposes the BOOM framework, which distills high-quality actions from an online planner into a policy network via a bootstrap alignment loop. By employing a likelihood-free forward KL divergence and a soft Q-weighting mechanism, BOOM effectively mitigates the actor divergence between the planner and the policy, achieving state-of-the-art performance on high-dimensional continuous control tasks.
Boundary-to-Region Supervision for Offline Safe Reinforcement Learning: This paper proposes B2R (Boundary-to-Region), a framework that addresses the symmetric conditioning fallacy of sequence models in offline safe RL by introducing Cost-to-Go (CTG) Realignment. It converts sparse boundary supervision into dense safe-region supervision, satisfying safety constraints on 35 out of 38 safety-critical tasks.
Certifying Concavity and Monotonicity in Games via Sum-of-Squares Hierarchies: This paper proves that verifying concavity and monotonicity in games with polynomial utilities and semi-algebraic strategy sets is NP-hard, and proposes two hierarchical certification schemes based on sum-of-squares (SOS) programming, each solvable in polynomial time at every level of the hierarchy.
Certifying Stability of Reinforcement Learning Policies using Generalized Lyapunov Functions: This paper proposes a Generalized Lyapunov Function framework that combines RL value functions with neural network residual terms, replacing the classical strict per-step descent requirement with a multi-step weighted descent condition to certify the stability of RL policies.
Checklists Are Better Than Reward Models For Aligning Language Models: This paper proposes Reinforcement Learning from Checklist Feedback (RLCF), which decomposes instructions into dynamically generated yes/no checklists, scores each item using an AI judge and code verifier, and trains with DPO. RLCF consistently improves Qwen2.5-7B-Instruct across 5 benchmarks and is the only method that achieves positive gains on all benchmarks (FollowBench +4pt, InFoBench +6pt, Arena-Hard +3pt).
Communicating Plans, Not Percepts: Scalable Multi-Agent Coordination with Embodied World Models: This paper proposes an "intention communication" architecture based on lightweight world models, enabling multi-agent coordination by generating and sharing future trajectory plans. The approach comprehensively outperforms end-to-end emergent communication methods in both scalability and performance.
Comparing Uniform Price and Discriminatory Multi-Unit Auctions through Regret Minimization: Under the online learning and regret minimization framework, this paper systematically compares the learning difficulty of uniform-price auctions and discriminatory auctions, proving that the two formats share identical worst-case regret rates, while under specific structural conditions the uniform-price auction admits faster learning rates.
Complexity Scaling Laws for Neural Models using Combinatorial Optimization: Using the Traveling Salesman Problem (TSP) as a case study, this paper investigates predictable scaling relationships between problem complexity (solution space size, representation space dimensionality) and model performance under fixed model capacity, revealing systematic performance trends for RL and SFT in combinatorial optimization.
Computational Hardness of Reinforcement Learning with Partial $q^\pi$-Realizability: This paper introduces the notion of "partial $q^\pi$-realizability" and proves that learning a near-optimal policy under this setting is NP-hard when using a greedy policy class, and requires exponential time under the rETH assumption when using a softmax policy class. These results bridge the theoretical gap between $q^*$-realizability and $q^\pi$-realizability.
Confounding Robust Deep Reinforcement Learning: A Causal Approach: This paper extends DQN via partial identification theory, proposing Causal DQN to learn robust policies from offline data with unobserved confounders—by optimizing a worst-case lower bound on the value function to obtain safe policies—and consistently outperforms standard DQN across 12 confounded Atari games.
Continual Knowledge Adaptation for Reinforcement Learning: This paper proposes CKA-RL, which maintains a task-specific knowledge vector for each task and employs softmax-weighted dynamic knowledge adaptation along with an adaptive knowledge merging mechanism, achieving a 4.20% overall performance gain and 8.02% forward transfer improvement across three continual RL benchmarks.
Convergence Theorems for Entropy-Regularized and Distributional Reinforcement Learning: This paper proposes the temperature decoupling gambit, proving that in entropy-regularized reinforcement learning, by decoupling the evaluation temperature from the behavioral temperature, both the policy and the return distribution converge—as the temperature tends to zero—to an interpretable, diversity-preserving optimal policy.
CORE: Constraint-Aware One-Step Reinforcement Learning for Simulation-Guided Neural Network Accelerator Design: This paper proposes CORE (Constraint-aware One-step REinforcement learning), a critic-free single-step RL framework that efficiently explores the joint hardware–mapping design space of DNN accelerators via structured distribution sampling, a scaling-graph decoder, and constraint-aware reward shaping, achieving at least 15× latency improvement across 7 DNN models.
Counteractive RL: Rethinking Core Principles for Efficient and Scalable Deep Reinforcement Learning: CoAct TD Learning challenges the random exploration paradigm of ε-greedy by selecting, with probability ε, the action that minimizes $Q(s,a)$ (rather than a random action) to obtain high temporal-difference signals. The paper theoretically proves that this produces larger TD errors, achieves a 248% performance improvement on Atari 100K, and requires only a 2-line code change with zero additional computation.
DCcluster-Opt: Benchmarking Dynamic Multi-Objective Optimization for Geo-Distributed Data Center Workloads: This paper proposes DCcluster-Opt, an open-source high-fidelity simulation benchmark platform for geo-distributed data centers. It integrates real-world datasets (carbon intensity, electricity prices, weather, etc.) and physics-based models to support reinforcement learning research on dynamic multi-objective workload scheduling.
Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation: SambaY proposes the Gated Memory Unit (GMU) for sharing SSM token-mixing representations across layers, replacing half of the cross-attention layers in YOCO's cross-decoder with lightweight GMUs. This maintains linear prefill complexity and long-context retrieval capability while substantially improving decoding efficiency. The resulting product, Phi4-mini-Flash-Reasoning (3.8B), outperforms Phi4-mini-Reasoning on reasoning benchmarks and achieves up to 10× decoding throughput improvement in the 2K prompt + 32K generation setting.
Deep RL Needs Deep Behavior Analysis: Exploring Implicit Planning by Model-Free Agents: This paper proposes ForageWorld, a naturalistic foraging environment, and a neuroscience-inspired joint behavior-neural analysis framework, revealing that model-free RNN-based DRL agents exhibit structured, planning-like behavior through emergent dynamics—without explicit memory modules or world models.
DeepDiver: Adaptive Search Intensity Scaling via Open-Web Reinforcement Learning: DeepDiver is an RL-driven search-reasoning framework that trains LLMs for information-seeking in real open-web environments, giving rise to an emergent behavior termed Search Intensity Scaling (SIS)—enabling a 7B model to match DeepSeek-R1 (671B) on knowledge-intensive tasks.
DISCOVER: Automated Curricula for Sparse-Reward Reinforcement Learning: This paper proposes DISCOVER, a goal selection strategy for sparse-reward long-horizon RL that simultaneously balances achievability, novelty, and relevance to construct curricula directed toward a target task. The authors theoretically prove that the number of steps to reach the goal scales linearly with goal distance rather than with the volume of the search space, and demonstrate significant improvements over prior state-of-the-art exploration strategies on high-dimensional navigation and manipulation tasks.
Distribution Learning Meets Graph Structure Sampling: This paper establishes a novel connection between PAC learning of high-dimensional probabilistic graphical models and efficient counting/sampling of graph structures. By reducing the maintenance of an exponentially large expert pool to a weighted DAG sampling problem via online learning frameworks (EWA/RWM), the paper presents the first efficient agnostic learning algorithm for Bayesian networks with chordal graph skeletons, and improves the sample complexity for tree-structured distributions from $O(nk^3/\varepsilon)$ to the optimal $O(nk^2/\varepsilon)$.
Dynamic Regret Reduces to Kernelized Static Regret: This paper reformulates dynamic regret minimization as a static regret problem in a reproducing kernel Hilbert space (RKHS), achieving the optimal path-length-dependent bound $\widetilde{\mathcal{O}}(\sqrt{MP_TT})$ via carefully designed shift-invariant kernels, without requiring prior knowledge of the time horizon.
Dynamics-Aligned Latent Imagination in Contextual World Models for Zero-Shot Generalization: DALI, a self-supervised context encoder, is introduced into the DreamerV3 architecture to infer latent environment parameters (e.g., gravity, friction) from interaction history. It achieves zero-shot generalization on cMDP benchmarks without retraining, outperforming ground-truth context-aware baselines by up to 96.4% on extrapolation tasks.
EgoBridge: Domain Adaptation for Generalizable Imitation from Egocentric Human Data: This paper proposes EgoBridge, a framework that uses Optimal Transport (OT) to align the joint distribution (features + actions) of human and robot data in a shared policy latent space, combined with Dynamic Time Warping (DTW) to construct pseudo-pairs, enabling cross-embodiment knowledge transfer from egocentric human data to robots, achieving up to 44% absolute improvement in success rate on real-world tasks.
Emergent World Beliefs: Exploring Transformers in Stochastic Games: This work extends the study of emergent world models in LLMs from perfect-information games (Othello, Chess) to the partial-information setting (Texas Hold'em Poker). By pre-training GPT-2 on PHH-format poker data and probing its internal activations, the paper demonstrates that the model not only learns deterministic features (hand rank recognition at ~98% accuracy) but also spontaneously develops internal representations of stochastic features (win rate/equity, correlation coefficient 0.59).
Empirical Study on Robustness and Resilience in Cooperative Multi-Agent Reinforcement Learning: Through 82,620 large-scale experiments, this work systematically investigates robustness and resilience in cooperative multi-agent RL, demonstrating that hyperparameter tuning matters more than algorithm selection, and revealing that commonly adopted practices such as parameter sharing, GAE, and PopArt are harmful under uncertainty. A set of practical hyperparameter recommendations is proposed.
Enhancing Interpretability in Deep Reinforcement Learning through Semantic Clustering: This paper proposes a Semantic Clustering Module (SCM) that combines a Feature Dimensionality Reduction (FDR) network with an adapted online VQ-VAE clustering mechanism, seamlessly integrated into the DRL training pipeline. The approach addresses the instability of t-SNE visualization and demonstrates that DRL inherently exhibits dynamic, semantics-based clustering behavior.
Exploration with Foundation Models: Capabilities, Limitations, and Hybrid Approaches: This paper systematically evaluates the zero-shot exploration capabilities of LLMs/VLMs on classic RL exploration tasks (bandits, Gridworld, Atari), identifies a knowing-doing gap in VLMs — where high-level reasoning succeeds but low-level control fails — and proposes a simple VLM-RL hybrid framework that substantially accelerates learning under idealized conditions.
Extending NGU to Multi-Agent RL: A Preliminary Study: This paper extends the single-agent NGU (Never Give Up) algorithm to multi-agent settings and conducts a systematic ablation across three design dimensions: shared replay buffer, shared novelty signal, and heterogeneous β parameters. The results show that NGU combined with a shared experience replay buffer significantly outperforms a multi-agent DQN baseline on the PettingZoo simple_tag pursuit task.
FedRAIN-Lite: Federated Reinforcement Algorithms for Improving Idealised Numerical Weather and Climate Models: This paper proposes FedRAIN-Lite, a federated reinforcement learning framework that assigns RL agents to individual latitude bands to learn local climate parameterization policies with periodic global aggregation. Evaluated on a hierarchical idealized energy balance model (EBM), DDPG with this framework reduces area-weighted RMSE by over 50% in tropical and mid-latitude regions, providing a viable pathway for scaling RL to full-scale GCMs.
Feel-Good Thompson Sampling for Contextual Bandits: a Markov Chain Monte Carlo Showdown: This paper presents the first systematic empirical evaluation of Feel-Good Thompson Sampling (FG-TS) and its smoothed variant SFG-TS under approximate posteriors, spanning linear, logistic, and neural contextual bandit settings across fourteen benchmarks. The study finds that FG-TS outperforms standard TS when exact posteriors are available (linear/logistic), but degrades in neural bandits, revealing a critical trade-off between optimistic bias and sampling noise.
Financial Instruction Following Evaluation (FIFE): FIFE is a challenging instruction-following benchmark for financial analysis tasks, comprising 88 manually authored complex prompts and 40+ chainable, domain-specific verifiable constraints. It evaluates 53 models under both strict and loose modes, revealing that even the strongest open-weight model (76.1% strict) fails to perfectly follow complex financial instruction requirements.
Finite-Sample Analysis of Policy Evaluation for Robust Average Reward Reinforcement Learning: This work provides the first finite-sample complexity analysis for policy evaluation in robust average reward MDPs. By constructing a carefully designed semi-norm, it proves that the robust Bellman operator is a contraction, and combines this with a truncated Multi-Level Monte Carlo (MLMC) estimator to achieve finite expected sample complexity, ultimately attaining an order-optimal sample complexity of $\tilde{\mathcal{O}}(\epsilon^{-2})$.
Forecasting in Offline Reinforcement Learning for Non-stationary Environments: This paper proposes Forl, a framework that fuses multimodal candidate states generated by a conditional diffusion model with shift predictions from a zero-shot time-series foundation model via Dimension-wise Closest Matching (DCM). Forl enables deployment-time adaptation to non-stationary observation functions that shift episodically, without retraining, achieving substantial average score improvements on D4RL benchmarks.
Foundation Models as World Models: A Foundational Study in Text-Based GridWorlds: This paper systematically evaluates foundation models (LLMs) as zero-shot world models (FWM) and direct decision-making agents (FA) in text-based gridworlds, revealing complementary advantages of the two strategies in deterministic and stochastic environments.
Gaussian Process Upper Confidence Bound Achieves Nearly-Optimal Regret in Noise-Free Gaussian Process Bandits: This paper proves that GP-UCB achieves nearly-optimal regret in the noise-free GP bandit problem, establishing for the first time $O(1)$ constant cumulative regret under the SE kernel and $O(1)$ cumulative regret under the Matérn kernel (when $d < \nu$), thereby closing a long-standing gap between the theory and practice of GP-UCB.
Generalized Linear Bandits: Almost Optimal Regret with One-Pass Update: This paper proposes the GLB-OMD algorithm, which, for the first time in the generalized linear bandit (GLB) setting, simultaneously achieves a near-optimal regret bound of $\mathcal{O}(\log T\sqrt{T/\kappa_*})$ and $\mathcal{O}(1)$ per-round time and space complexity. The key technical contribution is constructing tight confidence sets for an online mirror descent (OMD) estimator via mix loss.
Generalizing Verifiable Instruction Following: This paper introduces IFBench, a benchmark for evaluating generalization in precise instruction following, demonstrating that current SOTA models severely overfit to the 25 constraint templates of IFEval. It further proposes IF-RLVR, a training method based on GRPO with verifiable rewards, which significantly improves both in-domain and out-of-domain instruction following performance.
Global Convergence for Average Reward Constrained MDPs with Primal-Dual Actor-Critic: This paper proposes the Primal-Dual Natural Actor-Critic (PDNAC) algorithm, which achieves, for the first time, a global convergence rate of $\tilde{\mathcal{O}}(1/\sqrt{T})$ and a constraint violation rate of $\tilde{\mathcal{O}}(1/\sqrt{T})$ for average reward constrained MDPs under general parameterized policies, matching the theoretical lower bound.
Gradient-Variation Online Adaptivity for Accelerated Optimization with Hölder Smoothness: This paper develops gradient-variation adaptive online learning algorithms for Hölder smooth function classes, achieving regret that smoothly interpolates between the smooth and non-smooth extremes. Via online-to-batch conversion, it provides the first universal method for strongly convex optimization that attains accelerated convergence in the smooth case and near-optimal convergence in the non-smooth case.
GraphChain: Large Language Models for Large-scale Graph Analysis via Tool Chaining: This paper proposes GraphChain, a framework that enables LLMs to analyze large-scale graphs in a progressive, human-like exploratory manner through two key components: progressive graph distillation (RL-driven tool-chain sequence generation) and structure-aware test-time adaptation (lightweight adapters conditioned on graph topology fingerprints). GraphChain achieves an average accuracy of 84.7%, surpassing the best baseline by 20.7%, and scales to graphs with up to 200,000 nodes.
Greedy Algorithm for Structured Bandits: A Sharp Characterization of Asymptotic Success / Failure: This paper provides a complete theoretical characterization of the greedy algorithm in structured bandit problems, proposing self-identifiability as a necessary and sufficient condition for the greedy algorithm to achieve sublinear regret, and extends the results to contextual bandits and the general interactive decision-making framework DMSO.
Horizon Reduction Makes RL Scalable: Through large-scale experiments involving up to one billion transitions, this paper identifies the curse of horizon—excessively long decision horizons—as the primary scalability bottleneck in offline RL, and demonstrates that horizon reduction techniques such as n-step returns and hierarchical policies substantially improve scalability. Building on this analysis, the paper proposes SHARSA, a simple yet effective method.
Human-Inspired Multi-Level Reinforcement Learning: This paper proposes RbRL-KL, which augments rating-based RL (RbRL) with a KL divergence-driven policy loss term. By leveraging failure experiences across different rating levels with varying weights to repel the current policy, RbRL-KL outperforms standard RbRL across 6 DeepMind Control environments.
Hybrid Latent Reasoning via Reinforcement Learning: HRPO proposes a hybrid latent reasoning policy optimization framework: a learnable gating mechanism progressively blends the hidden state representation from the previous step into the sampled token embeddings, enabling LLMs to leverage both discrete tokens and continuous latent representations during inference. Without requiring CoT annotations, HRPO is trained entirely via RL and outperforms baselines such as PPO and GRPO on both knowledge-intensive and STEM reasoning tasks.
Improved Regret and Contextual Linear Extension for Pandora's Box and Prophet Inequality: This paper proposes new algorithms for the online Pandora's Box problem, improving regret from $\widetilde{O}(n\sqrt{T})$ to $\widetilde{O}(\sqrt{nT})$ (matching the lower bound), and introduces the first contextual linear extension achieving $\widetilde{O}(nd\sqrt{T})$ regret.
Improved Regret Bounds for GP-UCB in Bayesian Optimization: This paper proves that GP-UCB achieves $\widetilde{O}(\sqrt{T})$ high-probability regret under the Bayesian setting (when the Matérn kernel satisfies a smoothness condition) and $O(\sqrt{T \ln^2 T})$ for the SE kernel, closing the gap between existing upper bounds for GP-UCB and the optimal upper bounds.
Improving Planning and MBRL with Temporally-Extended Actions: This paper proposes treating action duration as an additional optimization variable in shooting-based planning and MBRL, combined with a multi-armed bandit (MAB) mechanism for automatic duration range selection. The approach significantly accelerates planning across multiple environments and solves challenging tasks that standard methods fail to handle.
Improving Retrieval-Augmented Generation through Multi-Agent Reinforcement Learning: This work models multiple components of a complex RAG pipeline (Query Rewriter, Selector, Generator) as a cooperative multi-agent system and jointly optimizes them via MAPPO, using the F1 score of the final answer as a shared reward. The proposed method outperforms existing single-module optimization approaches on multiple QA benchmarks.
Incentivizing Reasoning for Advanced Instruction-Following of Large Language Models: This paper proposes RAIF, which employs RL with rule-centric rewards to cultivate deep reasoning capabilities in LLMs for complex instructions containing And/Chain/Selection/Nested compositional constraints. A key finding is that vanilla CoT is detrimental to instruction following, as LLMs tend to shallowly paraphrase instructions rather than analyze constraint structures. RAIF addresses this through superior CoT enforcement (sample-level contrastive filtering of ineffective reasoning) and behavior cloning to control distribution shift. A 1.5B model trained with RAIF matches 8B-level performance, achieving an average improvement of 11.74% across 7 benchmarks.
Incremental Sequence Classification with Temporal Consistency: This paper imports the temporal-difference (TD) learning idea from reinforcement learning into sequence classification, proposing the TC-$\lambda$ loss function. By requiring the predictive distributions at adjacent time steps to satisfy a temporal consistency condition, it trains incremental sequence classifiers that outperform standard cross-entropy methods on both text classification and LLM verification tasks.
Inner Speech as Behavior Guides: Steerable Imitation of Diverse Behaviors for Human-AI Coordination: Inspired by Vygotsky's theory of inner speech, this paper proposes MIMIC, a framework that uses language as an intermediate representation between perception and action. A VLM provides language scaffolding to train a CVAE that generates inner speech, which then conditions a diffusion policy to produce diverse and steerable behaviors.
Interactive and Hybrid Imitation Learning: Provably Beating Behavior Cloning: When annotation cost is measured per state rather than per trajectory, the interactive method Stagger is provably shown to surpass Behavior Cloning under the $\mu$-recoverability condition (suboptimality $O(\mu H \log B / N)$ vs. $O(RH \log B / CN)$, with significant advantage when $\mu \ll R$). The paper further proposes a hybrid IL algorithm, Warm-Stagger, which combines offline data with interactive annotation to achieve strictly complementary advantages from both data sources on specific MDPs.
Inverse Optimization Latent Variable Models for Learning Costs Applied to Route Problems: This paper proposes IO-LVM (Inverse Optimization Latent Variable Model), which employs a VAE-style encoder to map observed COP solutions into a latent cost space. A Fenchel-Young loss combined with black-box solvers (Dijkstra/TSP solver) ensures feasibility at the decoding stage. The model learns the distribution of cost functions from route data without agent labels, and successfully separates navigation preferences of different agents in an unsupervised manner.
Kimina Lean Server: A High-Performance Lean Server for Large-Scale Verification: This paper presents Kimina Lean Server — a high-performance Lean 4 verification server designed for large-scale reinforcement learning training. By leveraging server-side parallelization and an LRU caching mechanism, it achieves 1.5–2× speedups over existing tools and has been used to train the state-of-the-art theorem proving model Kimina-Prover.
Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering: This paper proposes Wiki-PRF, a three-stage (Processing–Retrieval–Filtering) multimodal RAG framework that trains a VLM via reinforcement learning to autonomously invoke visual tools and filter retrieved results, achieving state-of-the-art performance on E-VQA and InfoSeek.
Last Iterate Convergence in Monotone Mean Field Games: This paper proposes a KL-divergence-based proximal point (PP) method that achieves asymptotic last iterate convergence (LIC) in non-strictly monotone mean field games (MFGs), and proves that regularized mirror descent (RMD) converges to regularized equilibria at an exponential rate. The combined approximate proximal point (APP) algorithm reliably converges to non-regularized equilibria on standard benchmarks.
Learning from Demonstrations via Capability-Aware Goal Sampling: This paper proposes Cago, a method that dynamically tracks an agent's attainment capability along expert demonstration trajectories and adaptively samples intermediate goals near the capability frontier, constructing an implicit curriculum to guide learning in long-horizon, sparse-reward tasks.
Learning Human-Like RL Agents through Trajectory Optimization with Action Quantization: This paper proposes MAQ (Motion-Action Quantization), a method that discretizes human actions into a finite set of motion primitives via VQ-VAE, then performs trajectory optimization within the quantized action space to train RL agents whose behavioral patterns more closely resemble those of humans.
Learning in Stackelberg Mean Field Games: A Non-Asymptotic Analysis: This paper proposes AC-SMFG, the first single-loop Actor-Critic algorithm with non-asymptotic convergence guarantees for solving Stackelberg Mean Field Games (SMFGs), achieving a convergence rate of $\widetilde{\mathcal{O}}(k^{-1/2})$.
Learning Interactive World Model for Object-Centric Reinforcement Learning: This paper proposes FIOC-WM, which learns the interaction structure among objects in a world model via a two-level factorization at the object and attribute levels. It trains a hierarchical policy grounded in interaction primitives, achieving more efficient policy learning and compositional generalization across multiple robot control tasks.
Learning Interestingness in Automated Mathematical Theory Formation: This paper proposes Fermat—a reinforcement learning environment that models mathematical theory formation as an MDP—and EvoAbstract, an LLM-driven evolutionary algorithm with abstraction learning, for automatically synthesizing interestingness metrics for mathematical objects. The approach substantially outperforms hard-coded baselines in elementary number theory and finite fields.
Learning Intractable Multimodal Policies with Reparameterization and Diversity Regularization: This paper proposes the Diversity-regularized Actor Critic (DrAC) algorithm, which unifies intractable multimodal policies (amortized actor and diffusion actor) under a stochastic-mapping formulation, enables direct policy gradient optimization via reparameterization without requiring probability density evaluation, and introduces a distance-based diversity regularization as an alternative to entropy regularization. DrAC demonstrates significant advantages on diversity-critical tasks such as multi-goal navigation and generative RL.
Learning Memory-Enhanced Improvement Heuristics for Flexible Job Shop Scheduling: This paper proposes MIStar—the first deep reinforcement learning (DRL)-based improvement heuristic framework for the Flexible Job Shop Scheduling Problem (FJSP). Key innovations include a directed heterogeneous disjunctive graph representation, a Memory-enhanced Heterogeneous Graph Neural Network (MHGNN), and a parallel greedy search strategy. MIStar consistently outperforms handcrafted improvement heuristics and state-of-the-art constructive DRL methods on both synthetic datasets and public benchmarks.
Learning to Clean: Reinforcement Learning for Noisy Label Correction: This paper formulates noisy label correction as a Markov Decision Process under the reinforcement learning framework, proposing RLNLC. A policy function built upon a k-nearest-neighbor embedding space determines which labels should be corrected, guided by a label consistency reward and a cross-subset alignment reward. RLNLC achieves state-of-the-art performance across multiple benchmark datasets under both instance-dependent and symmetric noise settings.
Learning to Focus: Prioritizing Informative Histories with Structured Attention Mechanisms in Partially Observable Reinforcement Learning: Two structured temporal priors—Memory-Length Prior and Gaussian Distributional Prior—are embedded into the self-attention mechanism of a Transformer world model. Under partially observable RL settings, Gaussian Attention achieves a 77% relative improvement in human-normalized score over UniZero on the Atari 100k benchmark with negligible computational overhead.
Massively Parallel Imitation Learning of Mouse Forelimb Musculoskeletal Reaching Dynamics: This work presents MIMIC-MJX, a massively parallel imitation learning pipeline for mouse forelimb musculoskeletal simulation. Leveraging JAX-accelerated PPO at 1.2 million steps/second across thousands of parallel environments, the pipeline trains physically-informed imitation learning policies. The study demonstrates that control cost regularization enables simulated muscle activity to better predict real EMG signals, and employs a Takens-theorem-based nonlinear dynamical systems approach to predict muscle activation from joint kinematics.
Mean-Field Sampling for Cooperative Multi-Agent Reinforcement Learning: This paper proposes the SUBSAMPLE-MFQ algorithm, which randomly samples $k$ agents from $n$ to perform mean-field Q-learning, reducing the sample complexity of multi-agent reinforcement learning from $\text{poly}(n)$ to $\text{poly}(k)$. The resulting optimality gap is only $\tilde{O}(1/\sqrt{k})$ (independent of $n$), achieving exponential speedup over standard mean-field MARL when $k = O(\log n)$.
Memo: Training Memory-Efficient Embodied Agents with Reinforcement Learning: This paper proposes Memo, a Transformer-based memory-augmented framework that periodically generates summary tokens to compress historical context. Memo matches or exceeds the performance of full-context Transformers while reducing the KV cache at inference time by 8–10×, and demonstrates superior generalization to long contexts as well as robustness under streaming inference.
Meta-World+: An Improved, Standardized, RL Benchmark: This paper systematically exposes how undocumented reward function inconsistencies across versions of the Meta-World benchmark distort algorithm comparisons, and releases a standardized new version, Meta-World+, which explicitly retains both V1 and V2 reward functions, introduces MT25/ML25 task sets, upgrades to the Gymnasium API, and enables fully reproducible evaluation for multi-task and meta-reinforcement learning.
MetaBox-v2: A Unified Benchmark Platform for Meta-Black-Box Optimization: MetaBox-v2 is a milestone upgrade to the Meta-Black-Box Optimization (MetaBBO) benchmark platform. It provides unified support for four learning paradigms (RL/SL/NE/ICL), reproduces 23 baseline algorithms, integrates 18 test suites (1900+ problem instances), and achieves 10–40× speedup via vectorized environments and distributed evaluation.
Mind the GAP! The Challenges of Scale in Pixel-based Deep Reinforcement Learning: This paper identifies the "bottleneck connection" between the encoder (convolutional layers $\phi$) and the fully connected layers ($\psi$) as the fundamental obstacle to scaling pixel-based deep RL networks, and proposes Global Average Pooling (GAP) — a minimal architectural change — to directly resolve this bottleneck. GAP achieves performance on par with or superior to complex methods (SoftMoE, sparse training) at substantially lower computational cost.
Mixing Expert Knowledge: Bring Human Thoughts Back to the Game of Go: This paper proposes LoGos, which applies mixed-domain expert data (Go) and general long chain-of-thought (CoT) reasoning data for cold-start fine-tuning followed by GRPO reinforcement learning, enabling a general-purpose LLM to reach professional-level Go performance while preserving strong general reasoning capabilities.
Models That Prove Their Own Correctness: This paper proposes the Self-Proving Models framework, in which a model proves the correctness of its outputs to a verifier algorithm via an interactive proof system. Two learning algorithms are introduced—Transcript Learning (TL) and Reinforcement Learning from Verifier Feedback (RLVF)—and experiments on the GCD computation task demonstrate that Annotated TL achieves 96% Verifiability.
Modulation of Temporal Decision-Making in a Deep Reinforcement Learning Agent under the Dual-Task Paradigm: DRL agents trained in a simplified Overcooked environment to perform either a single task (temporal production) or a dual task (temporal production + numerical comparison) exhibit significantly greater temporal overproduction across all four target durations in the dual-task condition—an emergent behavior that closely parallels the time overestimation phenomenon observed in human temporal perception research under dual-task paradigms.
MTL-KD: Multi-Task Learning Via Knowledge Distillation for Generalizable Neural Vehicle Routing Solver: This paper proposes MTL-KD, a multi-task learning framework based on knowledge distillation. It distills policy knowledge from multiple RL single-task teacher models into a heavy-decoder student model, achieving efficient unified solving across diverse VRP variants with superior generalization on large-scale instances.
Multi-Agent Collaboration via Evolving Orchestration: This paper proposes a "Puppeteer" multi-agent collaboration paradigm in which a centralized orchestrator learns via RL to dynamically select which agent to activate at each reasoning step. The approach simultaneously improves performance and efficiency on both closed-domain and open-domain tasks, and reveals that evolved topologies tend toward more compact cyclic structures.
Multi-Objective Reinforcement Learning with Max-Min Criterion: A Game-Theoretic Approach: This work reformulates entropy-regularized max-min multi-objective reinforcement learning as a two-player zero-sum regularized game, proposes the ERAM/ARAM algorithms with closed-form weight updates via mirror descent, and establishes global last-iterate convergence, substantially outperforming baselines across multiple MORL benchmarks.
Near-Optimal Quantum Algorithms for Computing (Coarse) Correlated Equilibria of General-Sum Games: This work presents the first quantum algorithms for computing correlated equilibria (CE) and coarse correlated equilibria (CCE) in multi-player general-sum games. By quantizing the multi-scale MWU framework and introducing a unified QRAM scheme, the paper achieves a near-optimal query complexity of $\tilde{O}(m\sqrt{n})$ in both the number of players $m$ and actions $n$, along with matching quantum lower bounds.
NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation: This paper proposes NoisyRollout, a data augmentation method with zero additional training cost. During GRPO-based VLM training, it mixes rollouts from clean and moderately perturbed images to enhance policy exploration diversity. Using only 2.1K samples, it achieves state-of-the-art performance among open-source RL fine-tuned models across five out-of-domain benchmarks.
Non-convex Entropic Mean-Field Optimization via Best Response Flow: This work extends Best Response Flow from convex functional optimization to the non-convex setting, proving that under sufficiently large entropic regularization the BR operator becomes a contraction in the $L^1$-Wasserstein distance, thereby guaranteeing the existence of a unique global minimizer and exponential convergence for non-convex objectives.
On the Global Optimality of Policy Gradient Methods in General Utility Reinforcement Learning: This paper establishes global optimality guarantees for policy gradient methods in reinforcement learning with general utilities (RLGU): in the tabular setting, global convergence is proved via a novel gradient dominance inequality; in large-scale state-action spaces, an occupancy measure approximation algorithm PG-OMA based on maximum likelihood estimation (MLE) is proposed, whose sample complexity depends only on the dimension $m$ of the function approximation class rather than the size of the state-action space.
Online Optimization for Offline Safe Reinforcement Learning: This paper proposes O3SRL, a framework that formalizes offline safe reinforcement learning as a minimax optimization problem. By combining an offline RL oracle with EXP3-based online optimization for adaptive Lagrange multiplier adjustment, O3SRL avoids unstable off-policy evaluation and achieves high reward under strict safety constraints.
Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning: Open Vision Reasoner (OVR) employs a two-stage training paradigm—linguistic cold start followed by large-scale multimodal RL—to effectively transfer cognitive behaviors (e.g., backtracking, verification) from language models to visual reasoning. Built on Qwen2.5-VL-7B, OVR achieves 51.8% on MathVision, the first model at this scale to surpass 50%, establishing a new state of the art among same-scale models.
Opinion: Towards Unified Expressive Policy Optimization for Robust Robot Learning: This paper proposes UEPO, a framework comprising three core components—multi-seed dynamics-aware diffusion policies, dynamic divergence regularization, and diffusion-based data augmentation—to address insufficient multimodal behavioral coverage and distribution shift in offline-to-online reinforcement learning, surpassing Uni-O4 on the D4RL benchmark.
Optimizing the Unknown: Black Box Bayesian Optimization with Energy-Based Model and Reinforcement Learning: This paper proposes REBMBO, a framework that unifies Gaussian Processes (local modeling), Energy-Based Models (EBM, global exploration), and PPO-based reinforcement learning (multi-step look-ahead) into a closed-loop Bayesian optimization system, achieving significant improvements over conventional BO methods on high-dimensional and multi-modal black-box optimization tasks.
Oryx: a Scalable Sequence Model for Many-Agent Coordination in Offline MARL: This paper proposes Oryx, a scalable sequence model algorithm for offline cooperative MARL that integrates the Retention-based Sable architecture with an autoregressive formulation of ICQ offline regularization. Through a dual-decoder that jointly outputs policies and Q-values, combined with counterfactual advantage estimation, Oryx achieves state-of-the-art performance on more than 80% of 65 datasets and demonstrates robust scalability to 50-agent scenarios.
Parameter-Free Algorithms for the Stochastically Extended Adversarial Model: This work presents the first parameter-free algorithms for the Stochastically Extended Adversarial (SEA) model, which bridges adversarial and stochastic online convex optimization. Without prior knowledge of the domain diameter $D$ and/or the Lipschitz constant $G$, the proposed algorithms—built upon Optimistic Online Newton Step (OONS)—achieve regret bounds comparable to those of parameter-aware methods.
Parameter Efficient Fine-tuning via Explained Variance Adaptation: This paper proposes Explained Variance Adaptation (EVA), which initializes LoRA matrices via incremental SVD on activation vectors from downstream data, provably maximizing the expected gradient signal. Combined with an adaptive rank allocation mechanism, EVA establishes a new accuracy–efficiency Pareto frontier across language generation/understanding, image classification, and reinforcement learning.
PARCO: Parallel AutoRegressive Models for Multi-Agent Combinatorial Optimization: PARCO is a framework that solves multi-agent combinatorial optimization problems efficiently via Communication Layers for inter-agent coordination, a Multiple Pointer Mechanism for parallel decoding, and a Priority-based Conflict Handler for conflict resolution.
Periodic Skill Discovery: This paper proposes Periodic Skill Discovery (PSD), a framework that maps states onto a circular latent space to naturally encode periodicity, enabling unsupervised discovery of diverse locomotion skills with varying periods.
Preference-based Reinforcement Learning beyond Pairwise Comparisons: Benefits of Multiple Options: This paper proposes the M-AUPO algorithm for preference-based reinforcement learning, leveraging the Plackett-Luce ranking model to handle multi-option comparison feedback, and provides the first theoretical proof that larger subset sizes directly improve sample efficiency.
Prompt Tuning Decision Transformers with Structured and Scalable Bandits: This paper proposes a structured prompt tuning method based on multi-armed bandits. By decomposing prompts into independent segments and leveraging a pretrained PDT as a feature extractor, the method reduces prompt search complexity from combinatorial explosion to linear scale, significantly improving inference performance of a frozen PDT backbone in multi-task offline RL.
Provable Ordering and Continuity in Vision-Language Pretraining for Generalizable Embodied Agents: This paper proposes AcTOL, which learns ordered and continuous vision-language representations via a visual-language ordering loss and a Brownian bridge constraint, without relying on rigid goal-reaching assumptions, achieving significant improvements on downstream simulated and real-world robot manipulation tasks.
Quantifying Generalisation in Imitation Learning: This paper proposes the Labyrinth benchmark environment, which achieves strict separation between training and evaluation data through controllable maze structure variations. It reveals severe deficiencies in the structural generalisation of current imitation learning methods (best method achieves only 5% success rate on the test set) and provides a systematic tool for evaluating generalisation in imitation learning.
Real-World Reinforcement Learning of Active Perception Behaviors: This paper proposes Asymmetric Advantage-Weighted Regression (AAWR), which leverages additional privileged sensors during training to estimate more accurate advantage functions, enabling efficient learning of active perception policies in the real world. AAWR outperforms all baselines across 8 manipulation tasks spanning varying degrees of partial observability.
Reasoning Gym: Reasoning Environments for Reinforcement Learning with Verifiable Rewards: This work releases Reasoning Gym, a library of 100+ procedurally generated reasoning tasks spanning algebra, arithmetic, algorithms, logic, geometry, graph theory, games, and more. Each task supports infinite data generation and parameterized difficulty control. Experiments demonstrate that RLVR training achieves significant skill transfer both within and across domains, and improves performance on external benchmarks such as MATH and GSM8K.
Reinforcement Learning for Long-Horizon Multi-Turn Search Agents: This paper demonstrates that a 14B-parameter search agent trained with RL can surpass frontier models on legal document retrieval (85% vs. GPT o3's 81%) through multi-turn interaction, enabled by a carefully designed segmented reward structure and a sufficiently long interaction horizon.
Reinforcement Learning Teachers of Test Time Scaling: This paper proposes the Reinforcement Learning Teacher (RLT) framework, which provides both the problem and the answer to a teacher model and trains it to generate effective explanatory reasoning chains rather than solving problems from scratch. This enables a 7B-parameter teacher to produce distillation data superior to that generated by models orders of magnitude larger.
Reinforcement Learning with Action Chunking: This paper proposes Q-chunking, which extends action chunking from imitation learning to TD-based reinforcement learning by running RL directly over a "chunked" action space, thereby improving exploration and sample efficiency in long-horizon sparse-reward tasks.
RePIC: Reinforced Post-Training for Personalizing Multi-Modal Language Models: This paper proposes RePIC, the first reinforcement learning-based post-training framework for multimodal large language models targeting personalized image captioning, which significantly outperforms SFT-based methods in multi-concept scenarios.
Retrosynthesis Planning via Worst-path Policy Optimisation in Tree-structured MDPs: This work reformulates retrosynthesis planning as a worst-path optimisation problem in tree-structured MDPs — the value of a synthesis tree is determined by its weakest path, since any dead-end path renders the entire tree invalid. The proposed method, InterRetro, optimises this worst-path objective via weighted self-imitation learning, achieving 100% success rate on Retro*-190, reducing path length by 4.9%, and attaining 92% of full performance with only 10% of training data.
Reward-Aware Proto-Representations in Reinforcement Learning: This paper systematically develops the theoretical foundations of the Default Representation (DR)—deriving DP and TD learning algorithms, analyzing the feature space structure, and proposing default features for function approximation—and demonstrates DR's reward-aware advantages over the Successor Representation (SR) across four settings: reward shaping, option discovery, exploration, and transfer learning.
Risk-Averse Constrained Reinforcement Learning with Optimized Certainty Equivalents: This paper proposes a reward-based risk-aware constrained RL framework that applies Optimized Certainty Equivalent (OCE) risk measures to both objectives and constraints, establishes parametric strong duality, and delivers a modular algorithm that wraps standard RL solvers (e.g., PPO) as a black box.
Risk-Averse Total-Reward Reinforcement Learning: This paper proposes risk-averse Q-learning algorithms (ERM-TRC and EVaR-TRC) for the undiscounted total-reward criterion (TRC). By exploiting the elicitability of ERM, the Bellman operator is reformulated as a stochastic gradient descent objective, and convergence guarantees are established.
RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning: Tango proposes a framework that alternately trains a generator and a verifier via RL — the verifier is a generative process-level LLM that evaluates reasoning step by step in natural language, trained solely with outcome-level correctness rewards (no step-level annotations), and mutually reinforced through co-evolution with the generator. On 7B/8B-scale models, Tango achieves SOTA, with a 100% relative improvement over vanilla GRPO on AIME 2025.
Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in Robotics: Robot-R1 proposes training large vision-language models (LVLMs) via reinforcement learning (GRPO) for embodied reasoning. By casting next keystate prediction as multiple-choice questions and optimizing reasoning paths with RL, a 7B-parameter model surpasses GPT-4o on low-level control reasoning tasks.
Robust Adversarial Reinforcement Learning in Stochastic Games via Sequence Modeling: This paper proposes CART (Conservative Adversarially Robust Decision Transformer), the first method to enhance the adversarial robustness of Decision Transformers in stochastic games. By modeling stage games and estimating NashQ values, CART addresses the over-optimism of ARDT under stochastic state transitions, achieving more accurate minimax value estimation and superior worst-case returns.
Robust and Diverse Multi-Agent Learning via Rational Policy Gradient: This paper proposes the Rationality-Preserving Optimization (RPO) framework and the Rational Policy Gradient (RPG) algorithm. By introducing manipulator agents and opponent shaping techniques, RPG eliminates suicidal behavior induced by adversarial optimization in both cooperative and general-sum games, while simultaneously achieving policy robustness and diversity.
RoiRL: Efficient, Self-Supervised Reasoning with Offline Iterative Reinforcement Learning: This paper proposes RoiRL, a lightweight self-supervised reasoning framework based on offline iterative reinforcement learning. By replacing online RL (e.g., TTRL) with a weighted log-likelihood objective, RoiRL enables self-improvement of LLM reasoning capabilities without requiring a reference model or ground-truth labels, achieving 2.5× faster training with superior performance.
Router-R1: Teaching LLMs Multi-Round Routing and Aggregation via Reinforcement Learning: Router-R1 frames multi-LLM routing and aggregation as a sequential decision-making process, employing an LLM itself as the router to interleave think and route actions. Trained via PPO with a triple reward covering format, correctness, and cost, Router-R1 outperforms all router baselines across 7 QA benchmarks and generalizes to previously unseen LLMs.
Sample-Efficient Tabular Self-Play for Offline Robust Reinforcement Learning: This paper proposes the RTZ-VI-LCB algorithm for offline robust two-player zero-sum Markov games (RTZM G). By combining pessimistic robust value iteration with Bernstein-style penalties, it achieves a near-optimal sample complexity of $O(C_r^* \cdot H^4 \cdot S \cdot (A+B) / \varepsilon^2)$, significantly improving upon the prior best result of $O(H^5 \cdot S^2 \cdot AB / \varepsilon^2)$ in terms of dependence on both the state space and the action space.
Sample Complexity of Distributionally Robust Average-Reward Reinforcement Learning: This paper establishes the first finite-sample convergence guarantees for distributionally robust average-reward reinforcement learning (DR-AMDP), proposing two algorithms (discount reduction and anchoring) that achieve near-optimal sample complexity of $\widetilde{O}(|S||A|t_{\mathrm{mix}}^2\varepsilon^{-2})$ under both KL and $f_k$-divergence uncertainty sets.
Scalable Neural Incentive Design with Parameterized Mean-Field Approximation: This paper proposes the AMID algorithm, which formalizes the multi-agent incentive design (ID) problem as a parameterized mean-field game (PMFG), proves that the finite-$N$-agent objective approximates the infinite-population limit at a rate of $\mathscr{O}(1/\sqrt{N})$, and achieves substantial revenue improvements across multiple auction settings.
Scalable Policy-Based RL Algorithms for POMDPs: This paper proposes approximating POMDPs as finite-state Superstate MDPs (where states are truncated histories), derives a tighter upper bound on the optimal value function gap (decaying exponentially with history length), and provides the first finite-time convergence guarantees for standard TD learning combined with policy optimization under non-Markovian sampling.
Self-Improving Embodied Foundation Models: This paper proposes a two-stage post-training framework for embodied foundation models: Stage 1 performs supervised fine-tuning via behavior cloning and steps-to-go prediction; Stage 2 leverages the resulting self-reward function and success detector for online RL self-improvement. Using only 1–3% additional data, the method achieves over 1.5× improvement in success rate and, for the first time, demonstrates a robot autonomously acquiring novel skills beyond the distribution of imitation data.
Sequential Monte Carlo for Policy Optimization in Continuous POMDPs: This paper proposes a nested Sequential Monte Carlo (SMC) algorithm grounded in non-Markovian Feynman-Kac models for policy optimization in continuous POMDPs, naturally capturing the value of information gathering without hand-crafted heuristics.
Sequential Multi-Agent Dynamic Algorithm Configuration: This paper proposes Seq-MADAC, a framework that models multi-hyperparameter dynamic configuration as a contextual sequential multi-agent MDP. By exploiting inherent inter-parameter dependencies via a Sequential Advantage Decomposition Network (SADN), it outperforms existing MARL methods on multi-objective optimization algorithm configuration.
Shift Before You Learn: Enabling Low-Rank Representations in Reinforcement Learning: This paper reveals that the successor measure in reinforcement learning is not intrinsically approximately low-rank, but a "shifted successor measure"—obtained by skipping the first few transition steps—naturally exhibits low-rank structure. A novel Type II Poincaré inequality is introduced to quantify the required shift, providing finite-sample theoretical guarantees and practical improvements for goal-conditioned RL.
Simultaneous Swap Regret Minimization via KL-Calibration: This paper introduces KL-Calibration as a stronger calibration measure, establishes its equivalence to the swap regret of log loss, and achieves a simultaneous swap regret bound of $\tilde{\mathcal{O}}(T^{1/3})$ via non-uniform discretization and a novel randomized rounding scheme, covering a broader class of proper losses than prior work.
Solving Continuous Mean Field Games: Deep Reinforcement Learning for Non-Stationary Dynamics: This paper proposes the DEDA-FP algorithm, which for the first time simultaneously learns Nash equilibrium policies and population distributions in non-stationary mean field games (MFGs) with continuous state/action spaces. By combining deep RL for best response computation, supervised learning for mean policy representation, and conditional Normalizing Flow for modeling time-varying population distributions, DEDA-FP achieves over 10× sampling efficiency compared to existing methods.
Solving Neural Min-Max Games: The Role of Architecture, Initialization & Dynamics: This paper provides the first convergence guarantees for zero-sum games parameterized by two-layer neural networks, proving that under sufficient overparameterization, random Gaussian initialization, and alternating gradient descent-ascent (AltGDA), the dynamics converge to an $\epsilon$-approximate Nash equilibrium with high probability.
Spatial-Aware Decision-Making with Ring Attractors in Reinforcement Learning Systems: This paper integrates ring attractor models from neuroscience into action selection in deep reinforcement learning (DRL). By mapping actions to spatial positions on a ring and injecting Gaussian signals encoding Q-values and uncertainty, the proposed approach achieves a 53% improvement over baseline on Atari 100K.
STAIR: Addressing Stage Misalignment through Temporal-Aligned Preference Reinforcement Learning: This paper identifies and formalizes the "stage misalignment" problem in Preference-based Reinforcement Learning (PbRL)—wherein comparing behavior segments from different task stages produces uninformative feedback—and proposes STAIR, a method that learns temporal distances via contrastive learning to approximate stage discrepancy. By employing a quadrilateral distance metric for stage-aligned query selection, STAIR substantially outperforms existing PbRL methods on multi-stage tasks.
Strategic Costs of Perceived Bias in Fair Selection: This paper employs a game-theoretic model to reveal a "perception-driven bias" mechanism: in purely merit-based selection systems, inter-group differences in perceived post-selection value lead to rational effort disparities, thereby systematically propagating inequality within ostensibly fair processes.
Structural Information-based Hierarchical Diffusion for Offline Reinforcement Learning: This paper proposes SIHD, a framework that leverages structural information (structural entropy) extracted from historical trajectories to adaptively construct multi-scale diffusion hierarchies, replaces local reward prediction with structural information gain as the conditional guidance signal, and introduces structural entropy regularization to encourage exploration of sparse states in offline data. SIHD achieves up to 12.6% improvement in decision-making performance on the D4RL benchmark.
Structured Reinforcement Learning for Combinatorial Decision-Making: This paper proposes Structured Reinforcement Learning (SRL), which embeds a combinatorial optimization solver as a differentiable layer within the actor of an actor-critic framework. End-to-end gradient propagation is achieved via Fenchel-Young loss with Gaussian perturbations, enabling purely online learning without expert demonstrations. SRL matches imitation learning and outperforms unstructured RL by up to 92% across six industrial-scale combinatorial decision-making problems.
Succeed or Learn Slowly: Sample Efficient Off-Policy Reinforcement Learning for Mobile App Control: This paper proposes the SoLS algorithm, which achieves sample-efficient RL fine-tuning of foundation models for mobile app control through an asymmetric policy update mechanism (aggressive learning on success, conservative regularization on failure) combined with Success Transition Replay (STR), attaining a 51.3% success rate on AndroidWorld.
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution: This work is the first to apply reinforcement learning (RL) to real-world software engineering tasks (GitHub PR/Issue resolution), training Llama-3.3-70B exclusively with a rule-based sequence-similarity reward. It achieves a 41.0% resolve rate on SWE-bench Verified (SOTA among medium-scale models). Notably, although RL training is conducted solely on issue-solving data, it elicits emergent generalization in out-of-domain tasks including code reasoning, mathematics, and general language understanding.
Teaching Language Models to Evolve with Users: Dynamic Profile Modeling for Personalized Alignment: This paper formulates personalized dialogue alignment as a multi-turn Markov Decision Process and proposes the RLPA framework, enabling LLMs to dynamically infer and maintain user profiles through online interaction with simulated users, and to generate personalized responses accordingly.
Temporal-Difference Variational Continual Learning: This paper proposes the TD-VCL objective, which reformulates the learning target in Variational Continual Learning (VCL) as a weighted combination of multiple past posterior estimates. This reformulation reveals a deep connection to temporal-difference (TD) methods in reinforcement learning, and effectively mitigates the progressive accumulation of approximation errors by "spreading" regularization pressure across multiple historical posteriors.
TensorRL-QAS: Reinforcement Learning with Tensor Networks for Improved Quantum Architecture Search: This work proposes TensorRL-QAS, a framework that warm-starts reinforcement learning-based quantum architecture search (RL-QAS) using tensor networks (MPS/DMRG), achieving up to 10× reduction in circuit depth and CNOT gate count, and up to 98% acceleration in training time, thereby effectively addressing the scalability bottleneck of RL-QAS on large-scale quantum systems.
The Burden of Interactive Alignment with Inconsistent Preferences: This paper models user interactions with engagement-driven algorithms as a multi-leader single-follower Stackelberg game, establishing a critical planning-horizon threshold: users whose effective horizon exceeds this threshold can align the algorithm to their interests, while those below it are instead aligned to the algorithm's objectives. The paper further demonstrates that introducing low-cost signals (e.g., an extra click) can substantially reduce the burden of alignment.
The Path Not Taken: RLVR Provably Learns Off the Principals: This paper proposes the Three-Gate Theory to explain the apparent sparsity of parameter updates in RLVR, demonstrating that RLVR learns along off-principal directions in weight space — a fundamentally different optimization mechanism from SFT — and that directly transplanting SFT-era PEFT methods to RLVR is therefore flawed.
The Physical Basis of Prediction: World Model Formation in Neural Organoids via an LLM-Generated Curriculum: This paper proposes a framework for studying world model formation in human neural organoids, comprising three progressively complex virtual environments (conditioned avoidance, predator–prey, Pong) and a meta-learning approach in which an LLM automatically generates experimental protocols, complemented by a multi-scale biophysical evaluation strategy to quantify the physical basis of biological learning.
The World Is Bigger! A Computationally-Embedded Perspective on the Big World Hypothesis: This paper formalizes the Big World Hypothesis from a computationally-embedded perspective, proves that agents embedded in universal-local environments are inherently capacity-constrained, proposes interactivity as a computational measure of continual adaptability, and empirically demonstrates that deep nonlinear networks fail to maintain interactivity while deep linear networks improve interactivity as capacity increases.
Thompson Sampling for Multi-Objective Linear Contextual Bandit: This paper proposes MOL-TS—the first multi-objective linear contextual bandit Thompson Sampling algorithm with worst-case Pareto regret guarantees. By introducing the concept of "effective Pareto optimal arms" and an optimistic sampling strategy, MOL-TS achieves a regret upper bound of $\widetilde{O}(d^{3/2}\sqrt{T})$, with the number of objectives $L$ contributing only an $O(\log L)$ factor.
Thompson Sampling in Function Spaces via Neural Operators: This paper extends Thompson Sampling (TS) from finite-dimensional parameter spaces to infinite-dimensional function spaces, leveraging neural operators as approximate samplers of Gaussian process posteriors to efficiently solve functional optimization problems involving partial differential equations (PDEs).
Time Reversal Symmetry for Efficient Robotic Manipulations in Deep Reinforcement Learning: This paper proposes the TR-DRL framework, which exploits time reversal symmetry in robotic manipulation tasks—via trajectory reversal augmentation (for fully reversible transitions) and time-reversal-guided potential-based reward shaping (for partially reversible transitions)—to significantly improve sample efficiency and final performance of DRL on paired tasks (e.g., door opening/closing).
To Distill or Decide? Understanding the Algorithmic Trade-off in Partially Observable Reinforcement Learning: Using a theoretical framework (perturbed Block MDP) and controlled locomotion experiments, this paper systematically investigates the algorithmic trade-off between privileged expert distillation and standard RL (without privileged information) in partially observable RL, finding that the trade-off is primarily governed by the stochasticity of latent state dynamics.
Towards Provable Emergence of In-Context Reinforcement Learning: This paper theoretically proves that the globally optimal parameters of a Transformer pretrained via standard RL objectives can implement in-context temporal difference (TD) learning, providing the first provable theoretical foundation for the in-context RL (ICRL) phenomenon.
Tractable Multinomial Logit Contextual Bandits with Non-Linear Utilities: This work presents ONL-MNL, the first computationally tractable and statistically optimal algorithm for the MNL contextual bandit problem under non-linear utility functions (including neural networks), achieving $\widetilde{\mathcal{O}}(\sqrt{T})$ regret without relying on NTK assumptions.
Training Language Models to Reason Efficiently: By incorporating a length penalty term into the RL reward—multiplying the correctness reward by $(1 - \alpha \cdot \sigma(\text{norm\_len}))$—and using a single hyperparameter $\alpha$ to control the token–accuracy trade-off curve, this work achieves a 50% reduction in token usage with less than 5% accuracy degradation on 7B reasoning models after only 100 RL training steps.
TRiCo: Triadic Game-Theoretic Co-Training for Robust Semi-Supervised Learning: This paper proposes TRiCo, a framework that reformulates semi-supervised learning as a three-player Stackelberg game among a teacher, two student classifiers, and an adversarial generator. It replaces confidence-based thresholding with mutual information for pseudo-label selection and employs a meta-learning teacher to adaptively regulate training dynamics, achieving state-of-the-art performance under low-label regimes.
Trust Region Reward Optimization and Proximal Inverse Reward Optimization Algorithm: This paper proposes the TRRO theoretical framework and the PIRO practical algorithm, which guarantee monotonic improvement of reward function updates in IRL via a Minorization-Maximization procedure, achieving stability guarantees analogous to those of TRPO/PPO in forward RL.
Variance-Aware Feel-Good Thompson Sampling for Contextual Bandits: This paper proposes the FGTS-VA algorithm, which for the first time achieves a variance-aware contextual bandit algorithm based on Feel-Good Thompson Sampling. The resulting regret bound is optimal in the model dimension $d$, matching the best variance-dependent regret bounds established by UCB-based methods.
VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning: This paper introduces VIKI-Bench, the first hierarchical benchmark for embodied multi-agent cooperation, comprising three evaluation levels—agent activation, task planning, and trajectory perception—and proposes VIKI-R, a two-stage training framework combining CoT-supervised fine-tuning with multi-level reward reinforcement learning. The framework achieves significant improvements over baselines across diverse robot morphologies and multi-view visual observations, with combinatorial coordination patterns emerging during the RL stage.
VolleyBots: A Testbed for Multi-Drone Volleyball Game Combining Motion Control and Strategic Play: This paper presents VolleyBots, a multi-drone volleyball competition testbed that integrates cooperative-adversarial gameplay, turn-based interaction, and agile 3D motion control. Built on Isaac Sim, it establishes a task curriculum from single-agent training to multi-agent competition. A hierarchical policy achieves a 69.5% win rate on the 3v3 task, with demonstrated zero-shot sim-to-real transfer.
When Can Model-Free Reinforcement Learning be Enough for Thinking?: This paper proposes the Thought MDP formalism to characterize the conditions under which "thinking" behavior emerges under model-free RL: policy initialization is the decisive factor; thinking actions are equivalent to the agent performing one step of policy improvement before acting; and open-source LLMs satisfy the necessary conditions for thinking to emerge.
When Less Language is More: Language-Reasoning Disentanglement Makes LLMs Better Multilingual Reasoners: Inspired by cognitive neuroscience (the relative independence of reasoning and language processing in the human brain), this work identifies and removes language-specific components in the activation space of LLMs to disentangle language from reasoning, achieving consistent improvements in multilingual reasoning performance without any training.
Zero-Shot Context Generalization in Reinforcement Learning from Few Training Contexts: This paper proposes the Context-Enhanced Bellman Equation (CEBE) and Context Sample Enhancement (CSE), which leverage first-order derivative information of environment dynamics and reward functions with respect to context parameters to achieve zero-shot generalization to unseen contexts when training is restricted to a single context.
Zeroth-Order Optimization Finds Flat Minima: This paper provides the first theoretical proof that standard zeroth-order optimization (two-point gradient estimation) exhibits an implicit regularization effect—converging to flat minima that minimize the Hessian trace—with a convergence complexity of $T = \mathcal{O}(d^4/\epsilon^2)$ established under convexity and sufficient smoothness conditions.

🧩 Multimodal VLM¶

A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1: This paper proposes M-Attack, which performs random cropping on source images and aligns them with target images via local-global or local-local matching in the embedding space, combined with a multi-CLIP model ensemble. This causes adversarial perturbations to naturally concentrate on semantically critical regions, forming clear semantic details. M-Attack achieves >90% targeted attack success rate against commercial black-box LVLMs including GPT-4.5/4o/o1.
A Multimodal Benchmark for Framing of Oil & Gas Advertising and Potential Greenwashing Detection: This work introduces the first multimodal framing analysis benchmark for oil and gas (O&G) industry video advertisements, comprising 706 videos, 13 framing categories, 50+ entities, and 20 countries. It systematically evaluates six VLMs on greenwashing-related framing detection, finding that GPT-4.1 achieves 79% F1 zero-shot on environmental labels but only 46% on green innovation, thereby exposing implicit framing analysis and cultural context understanding as core challenges for current VLMs.
ACT as Human: Multimodal Large Language Model Data Annotation with Critical Thinking: This paper proposes ACT (Annotation with Critical Thinking), a data pipeline in which an MLLM annotates all samples in bulk, a second MLLM acting as a critic estimates the error probability of each annotation, and only high-suspicion samples are routed to human reviewers. Combined with a theoretically derived ACT loss function, the approach achieves 70–90% reduction in human annotation cost across six cross-modal datasets while maintaining a downstream performance gap of less than 2%.
AdaLRS: Loss-Guided Adaptive Learning Rate Search for Efficient Foundation Model Pretraining: AdaLRS is proposed as a plug-and-play online learning rate search algorithm that adaptively adjusts the learning rate by monitoring the loss descent velocity, reducing the cost of learning rate hyperparameter search from multiple independent training runs to a single run, achieving approximately 50% savings in training cost.
Adapting Vision-Language Models for Evaluating World Models: This paper proposes UNIVERSE (UNIfied Vision-language Evaluator for Rollouts in Simulated Environments), a unified semantic evaluator for world model rollouts constructed by fine-tuning only the projection head of PaliGemma 2 (0.07% of total parameters). UNIVERSE achieves performance comparable to task-specific models on action recognition and character recognition, while exhibiting strong alignment with human judgments.
ADMN: A Layer-Wise Adaptive Multimodal Network for Dynamic Input Noise and Compute Resources: This paper proposes ADMN (Adaptive Depth Multimodal Network), a two-stage training framework: (1) Multimodal LayerDrop fine-tuning to make the backbone robust to arbitrary layer configurations, and (2) a QoI-aware controller that dynamically allocates layer budgets across modalities. ADMN adaptively assigns layers based on per-modality quality-of-information (QoI) under strict compute constraints, matching full-model accuracy while reducing FLOPs by 75% and latency by 60%.
Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning: This paper proposes CLIC, which concatenates two images to form a composite scene and generates hard negatives via cross-image lexical swapping, while constructing multiple positive captions to enhance semantic invariance. By fine-tuning only the CLIP text encoder, CLIC simultaneously improves compositional reasoning (achieving SOTA on SugarCrepe++) and downstream retrieval performance, resolving the long-standing trade-off between compositionality and retrieval in prior methods.
AffordBot: 3D Fine-grained Embodied Reasoning via Multimodal Large Language Models: This paper introduces a fine-grained 3D embodied reasoning task—jointly predicting the spatial location, motion type, and motion axis of actionable elements—and proposes rendering 3D point clouds into panoramic views with projected affordance candidates, guided by a customized Chain-of-Thought (CoT) reasoning paradigm for MLLMs, achieving state-of-the-art performance with AP25 of 23.3%.
Aligning by Misaligning: Boundary-aware Curriculum Learning for Multimodal Alignment: This paper proposes BACL (Boundary-Aware Curriculum with Local Attention), which combines a learnable boundary-aware negative sampler (via easy-to-hard curriculum learning) with a contrastive local attention loss (for token-level mismatch localization). On LAION-400M, BACL yields a +32% R@1 improvement over CLIP and achieves state-of-the-art results on four large-scale benchmarks.
AntiGrounding: Lifting Robotic Actions into VLM Representation Space for Decision Making: This paper inverts the conventional instruction grounding paradigm — rather than compressing VLM knowledge into intermediate representations (symbolic skills or constraints), it renders candidate robot trajectories into multi-view scene images and evaluates action proposals directly within the VLM's native high-dimensional representation space, enabling zero-shot closed-loop robotic manipulation control.
Approximate Domain Unlearning for Vision-Language Models: This paper introduces Approximate Domain Unlearning (ADU), a novel task that enables pretrained VLMs to selectively forget recognition capabilities for specified domains (e.g., illustrations, sketches) while preserving classification accuracy on other domains (e.g., real photographs). Two modules are proposed — Domain Disentangling Loss (DDL) and Instance-wise Prompt Generator (InstaPG) — achieving substantial improvements over all baselines across four multi-domain datasets.
AQuaMaM: An Autoregressive, Quaternion Manifold Model for Rapidly Estimating Complex SO(3) Distributions: This paper proposes AQuaMaM—a Transformer-based autoregressive quaternion manifold model that represents each projected component of the unit quaternion as a geometrically constrained mixture of uniform distributions, enabling exact likelihood computation and fast sampling on the SO(3) rotation manifold. AQuaMaM achieves 52× faster inference and 14% higher log-likelihood compared to IPDF, with sampled distributions that closely match the ground truth.
Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering: This paper presents DeepTumorVQA, a 3D diagnostic-grade visual question answering benchmark for abdominal CT tumors, comprising 9,262 CT volumes (3.7 million slices) and 395K expert-level questions. It systematically evaluates the clinical diagnostic capability of four state-of-the-art VLMs, finding that current models perform acceptably on measurement tasks but fall far short of clinical requirements in lesion recognition and reasoning.
Attention! Your Vision Language Model Could Be Maliciously Manipulated: This paper proposes the Vision-language Model Manipulation Attack (VMA), an image-based adversarial attack method that combines first- and second-order momentum optimization with a differentiable transformation mechanism, enabling precise control over every output token of a VLM. The approach supports a range of attack scenarios (jailbreaking, hijacking, privacy breach, DoS, sponge examples) and can also be repurposed for copyright-protection watermark injection.
Balanced Token Pruning: Accelerating Vision Language Models Beyond Local Optimization: This paper proposes Balanced Token Pruning (BTP), which jointly considers the impact of pruning on both the current layer (local) and subsequent layers (global). BTP emphasizes diversity preservation in shallow layers to maintain downstream representation quality, and attention-based selection in deep layers to preserve local output consistency. On multiple LVLMs including LLaVA and Qwen2.5-VL, BTP retains 98% of the original model's performance while keeping only 22% of visual tokens.
Better Tokens for Better 3D: Advancing Vision-Language Modeling in 3D Medical Imaging: This paper proposes BTB3D, a 3D CT tokenizer based on causal convolutional codec, 3D Haar wavelet compression, and a three-stage progressive training strategy. It achieves substantial state-of-the-art improvements on two downstream tasks—radiology report generation and text-conditioned CT synthesis—demonstrating that "better tokens matter more than larger language models."
Beyond Greedy Exits: Improved Early Exit Decisions for Risk Control and Reliability: UAT (Unsupervised Adaptive Thresholding) designs a reliability function for early-exit DNNs to assess the quality of intermediate layer outputs, and employs a multi-armed bandit (MAB) algorithm to dynamically learn optimal exit thresholds at inference time, achieving 1.7–2.1× speedup with less than 2% performance degradation while remaining robust to distribution shift.
BioCLIP 2: Emergent Properties from Scaling Hierarchical Contrastive Learning: BioCLIP 2 trains a ViT-L on TreeOfLife-200M (214M images across 952K species) using hierarchical contrastive learning, achieving an 18% improvement over BioCLIP in zero-shot species recognition. The work further uncovers emergent properties arising from scale: embeddings automatically encode ecological relationships (e.g., Darwin's finches arranged by beak size), and intra-species variation is orthogonal to inter-species variation.
Breaking the Compression Ceiling: Data-Free Pipeline for Ultra-Efficient Delta Compression: This paper proposes UltraDelta — the first data-free delta weight compression pipeline — which achieves compression ratios up to 224× across LLM/NLP/vision/multimodal models without performance degradation and even surpasses fine-tuned models, via three components: variance-guided mixed sparsity allocation, distribution-aware compression, and trace-norm-guided rescaling.
BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models: This paper proposes BridgeVLA, which projects 3D point clouds into multi-view 2D images and uses 2D heatmaps as an intermediate representation to align the input and output spaces, enabling efficient and effective 3D robot manipulation learning.
Can LLMs Reason Over Non-Text Modalities in a Training-Free Manner? A Case Study with In-Context Representation Learning: This paper proposes In-Context Representation Learning (ICRL), the first training-free framework that injects representations from non-text-modality foundation models (FMs) into a text-only LLM for few-shot reasoning. Two strategies are introduced: PCA-based text-level injection and optimal transport (OT)-based embedding alignment, enabling cross-modal knowledge utilization without any parameter updates.
Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?: This paper introduces the Qualcomm Interactive Cooking benchmark and the LiveMamba model, presenting the first systematic evaluation of multimodal LLMs for providing real-time, step-by-step task guidance in streaming video — encompassing instruction delivery, completion detection, and error feedback.
CAPability: A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness: This paper proposes CAPability, a comprehensive visual captioning benchmark covering 12 dimensions across 6 perspectives. It annotates visual elements (rather than sentences) for nearly 11K images and videos, simultaneously evaluating caption correctness (precision) and thoroughness (hit). A novel "Knows but doesn't Tell" ($K\bar{T}$) metric is introduced to reveal the significant capability gap between MLLMs in QA versus captioning tasks.
Causal-LLaVA: Causal Disentanglement for Mitigating Hallucination in Multimodal Large Language Models: This paper identifies the root cause of object hallucination in MLLMs at the representation level—semantic entanglement induced by dataset co-occurrence bias—and proposes a dual-path causal disentanglement framework (Causal-Driven Projector + Causal Intervention Module). By applying backdoor adjustment at both the projector and the final Transformer layer to decouple co-occurring object representations, the method achieves a 22.6% improvement on MME-Perception.
ChartMuseum: Testing Chart Visual Reasoning in Large Vision-Language Models: This paper introduces ChartMuseum, a chart question-answering benchmark comprising 1,162 expert-annotated questions and real-world charts from 184 distinct sources. It is the first benchmark to systematically distinguish visual reasoning from textual reasoning, revealing that the current strongest model, Gemini-2.5-Pro, achieves only 63.0% accuracy compared to 93% for humans, with visual reasoning performance lagging behind textual reasoning by 35%–55%.
CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language Models: This paper proposes CHOICE, a large-scale multi-level VLM benchmark for the remote sensing domain, comprising 10,507 newly collected questions spanning 2 top-level dimensions, 6 sub-dimensions, and 23 leaf tasks across perception and reasoning, enabling the first systematic and objective evaluation of VLM remote sensing capabilities.
CoIDO: Efficient Data Selection for Visual Instruction Tuning via Coupled Importance-Diversity Optimization: This paper proposes CoIDO, a bi-objective optimization framework for data selection that jointly optimizes data importance and diversity. By training a lightweight scorer on only 20% of randomly sampled data, CoIDO selects a 20% subset from LLaVA-665K that achieves 98.2% of the performance of full-data fine-tuning, while incurring the lowest computational overhead among all compared methods.
Context Informs Pragmatic Interpretation in Vision-Language Models: This work systematically evaluates the pragmatic reasoning capabilities of VLMs using iterated reference games. Models perform substantially worse than humans in the absence of context, but can rapidly leverage relevant dialogue history to achieve approximately 80% accuracy, revealing a strong dependence on contextual information.
Continual Multimodal Contrastive Learning: This paper is the first to formally define the Continual Multimodal Contrastive Learning (CMCL) problem and proposes Dual-side Null-space gradient projection (DNS), which projects gradients from new data into subspaces that do not interfere with previously acquired knowledge. DNS achieves the best stability–plasticity trade-off across 7 datasets.
CovMatch: Cross-Covariance Guided Multimodal Dataset Distillation with Trainable Text Encoder: This paper proposes CovMatch, which reduces the bi-level optimization of multimodal contrastive learning to a closed-form cross-covariance matrix alignment problem, enabling for the first time joint optimization of both image and text encoders for multimodal dataset distillation. Using only 500 synthetic image-text pairs, CovMatch achieves a mean retrieval recall of 38.4 on Flickr30K (+6.8% over SOTA LoRS), substantially outperforming frozen-text-encoder approaches in extremely data-efficient settings.
CyIN: Cyclic Informative Latent Space for Bridging Complete and Incomplete Multimodal Learning: This paper proposes the CyIN framework, which constructs an informative latent space via token-level and label-level information bottlenecks (IB), and employs cyclic cross-modal translation to reconstruct missing modality information, simultaneously optimizing complete and incomplete multimodal learning within a single unified model.
DanmakuTPPBench: A Multi-modal Benchmark for Temporal Point Process Modeling and Understanding: This paper introduces DanmakuTPPBench, the first multi-modal Temporal Point Process (TPP) benchmark integrating temporal, textual, and visual modalities. It comprises DanmakuTPP-Events (7,250 video sequences with 10.8 million danmaku events collected from Bilibili) and DanmakuTPP-QA (10 evaluation tasks constructed via a multi-agent pipeline), revealing significant gaps in current LLM/MLLM capabilities for TPP understanding.
DanmakuTPPBench: A Multi-modal Benchmark for Temporal Point Process Modeling and Understanding: This paper introduces DanmakuTPPBench, the first multimodal temporal point process benchmark. DanmakuTPP-Events provides 7,250 sequences comprising 10.8 million Danmaku events with natural three-modal alignment (time–text–video). DanmakuTPP-QA automatically generates 10 categories of reasoning question–answer pairs via a multi-agent pipeline. The benchmark systematically reveals significant deficiencies of both classical TPP models and MLLMs in understanding multimodal event dynamics.
Don't Just Chase "Highlighted Tokens" in MLLMs: Revisiting Visual Holistic Context Retention: This paper proposes HoloV, a plug-and-play visual token pruning framework that adaptively allocates pruning budgets across different spatial crop regions to preserve global visual context rather than retaining only attention-highlighted tokens. On LLaVA-1.5, HoloV retains 95.8% of original performance after pruning 88.9% of visual tokens.
DOTA: DistributiOnal Test-time Adaptation of Vision-Language Models: DOTA proposes shifting test-time adaptation from a "caching sample instances" paradigm to a "continuously estimating test data distributions" paradigm. By combining online Gaussian discriminant analysis with zero-shot prediction probabilities to estimate per-class distributions, DOTA achieves gradient-free, forgetting-resistant, and efficient test-time adaptation, surpassing all baselines in average accuracy across 10 cross-domain benchmarks.
DynamicVL: Benchmarking MLLMs for Dynamic City Understanding: This paper proposes DVL-Suite, a framework comprising the DVL-Bench evaluation benchmark and the DVL-Instruct instruction-tuning dataset, covering 42 U.S. cities and 14,871 high-resolution multi-temporal remote sensing images. It systematically evaluates 18 MLLMs on long-term urban dynamic understanding and introduces DVLChat as a baseline model.
Efficient Multi-modal Large Language Models via Progressive Consistency Distillation: This paper proposes EPIC, a framework that addresses the optimization difficulty caused by feature space perturbation during visual token compression training via progressive consistency distillation along two dimensions (Token and Layer), achieving efficient multimodal LLMs without modifying model architecture.
ElasticMM: Efficient MLLM Serving with Elastic Multimodal Parallelism: This paper proposes the Elastic Multimodal Parallelism (EMP) paradigm and the ElasticMM system, which disaggregates different stages of multimodal inference into independent instances via modality-aware load balancing and elastic partition scheduling, achieving up to 4.2× TTFT reduction and 3.2–4.5× throughput improvement over vLLM.
READ: Enhancing Compositional Reasoning in CLIP via Reconstruction and Alignment of Text Descriptions: This paper proposes READ, a fine-tuning method that enhances the compositional reasoning capability of CLIP's text encoder via two auxiliary objectives: (1) token-level reconstruction, where a frozen decoder reconstructs alternative descriptions from text embeddings, and (2) sentence-level alignment, which enforces consistency among embeddings of paraphrases. READ achieves state-of-the-art performance on 5 compositional reasoning benchmarks, outperforming NegCLIP by 4.5% and FSC-CLIP by 4.1%.
Enhancing Outcome Reward-Based RL Training of MLLMs with Self-Consistency Sampling: To address the problem of "unfaithful reasoning trajectories induced by outcome-reward RL training in multimodal multiple-choice tasks," this paper proposes Self-Consistency Sampling (SCS), which obtains consistency rewards via truncation-resampling and visual perturbation to penalize spurious reasoning. When combined with RLOO, SCS achieves an average improvement of 7.7 percentage points across six benchmarks.
Enhancing Vision-Language Model Reliability with Uncertainty-Guided Dropout Decoding: This paper proposes Dropout Decoding — a training-free inference-time method that projects visual tokens into the text space to quantify their epistemic uncertainty, selectively masks high-uncertainty visual tokens, and aggregates multiple masked decoding results via majority voting to substantially reduce object hallucinations in LVLMs.
Evaluating Multimodal Large Language Models on Core Music Perception Tasks: This paper systematically evaluates multimodal LLMs on three core music perception tasks—syncopation scoring, transposition detection, and chord quality identification—under both audio and MIDI input modalities, revealing that models approach ceiling performance on symbolic reasoning while exhibiting significant deficits in audio perception.
First SFT, Second RL, Third UPT: Continual Improving Multi-Modal LLM Reasoning via Unsupervised Post-Training: This paper proposes MM-UPT, a framework that introduces a third-stage "unsupervised post-training" phase following SFT and RL. By combining majority voting as a pseudo-reward signal with GRPO, MM-UPT enables self-improvement of MLLMs, boosting Qwen2.5-VL-7B from 66.3% to 72.9% on MathVista.
FlexAC: Towards Flexible Control of Associative Reasoning in Multimodal Large Language Models: FlexAC identifies that associative reasoning in MLLMs is primarily encoded in intermediate layers. By extracting steering vectors from hallucinated responses and injecting them into intermediate-layer representations at inference time, it enables flexible control over faithfulness and creativity—reducing hallucination rate by 29% (CHAIR) and improving creativity by 5.8× (Creation-MMBench), all without any training.
FlowCut: Rethinking Redundancy via Information Flow for Efficient Vision-Language Models: FlowCut reexamines the emergence of visual token redundancy in VLMs through the lens of Information Flow, and proposes a pruning framework featuring layer-adaptive pruning ratios, multi-criteria fusion scoring, and cumulative importance tracking. The approach aligns pruning decisions with the model's intrinsic information propagation behavior. On LLaVA-1.5-7B, FlowCut surpasses the previous SOTA by 1.6% at an 88.9% token reduction rate; on LLaVA-NeXT-7B, it surpasses the previous SOTA by 4.3% at a 94.4% reduction rate.
FlySearch: Exploring how vision-language models explore: FlySearch introduces a photorealistic 3D outdoor environment built on Unreal Engine 5 to evaluate the exploration capabilities of VLMs. Results reveal that state-of-the-art VLMs fail to reliably complete even simple search tasks, and the performance gap relative to humans widens dramatically as task difficulty increases.
FOCUS: Internal MLLM Representations for Efficient Fine-Grained Visual Question Answering: This paper proposes FOCUS, a training-free visual cropping method that constructs object relevance maps via cosine similarity of value features in the MLLM's internal KV-cache, enabling efficient localization of question-relevant image regions. FOCUS achieves accuracy comparable to state-of-the-art methods on fine-grained VQA benchmarks while improving computational efficiency by 3–6.5×.
ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation: This paper proposes ForceVLA, which introduces 6-axis force/torque sensing as a first-class modality within the VLA framework. A Force-aware Vision-Language Mixture-of-Experts (FVLMoE) module dynamically fuses visual-language embeddings with real-time force feedback at the action decoding stage, achieving an average success rate improvement of 23.2% across five contact-rich manipulation tasks, with individual tasks reaching up to 80%.
GEM: Empowering MLLM for Grounded ECG Understanding with Time Series and Images: GEM is proposed as the first multimodal large language model that unifies ECG time series, 12-lead ECG images, and text. Through a dual-encoder framework, cross-modal alignment, and knowledge-guided instruction data generation, GEM achieves grounded ECG diagnosis based on quantifiable physiological features, improving diagnostic accuracy by 7.4%, interpretability by 22.7%, and grounding capability by 25.3%.
Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling: This paper proposes REVERSE, the first framework to unify generation adjustment and post-hoc verification within a single VLM. Through hallucination-aware training on 1.3M semi-synthetic samples combined with inference-time retrospective resampling, REVERSE enables a VLM to automatically detect and correct hallucinations during generation, achieving a 12% reduction on CHAIR-MSCOCO and a 34% improvement on HaloQuest.
GeoRanker: Distance-Aware Ranking for Worldwide Image Geolocalization: This paper proposes GeoRanker, a distance-aware ranking framework that leverages large vision-language models (LVLMs) to model spatial relationships between queries and candidates, achieving state-of-the-art worldwide image geolocalization via a multi-order distance loss.
GLSim: Detecting Object Hallucinations in LVLMs via Global-Local Similarity: GLSim is a training-free object hallucination detection method for LVLMs that combines a global scene similarity score (cosine similarity between the object token and the last instruction token) and a local visual grounding similarity score (cosine similarity between the object token and the Top-K image patch embeddings localized via Visual Logit Lens). It achieves 83.7% AUROC on MSCOCO, surpassing SVAR by 9% and Internal Confidence by 10.8%.
GoalLadder: Incremental Goal Discovery with Vision-Language Models: This paper proposes GoalLadder, a framework that leverages VLMs to incrementally discover and rank candidate goal states, employs an ELO rating system to handle noisy feedback, and defines distance-based rewards in a learned embedding space. Using only a single language instruction, the method trains RL agents to achieve approximately 95% success rate.
Guiding Cross-Modal Representations with MLLM Priors via Preference Alignment: This paper proposes MAPLE, a framework that leverages the inherent modality alignment capabilities of off-the-shelf MLLMs to automatically construct preference data, and introduces a Relative Preference Alignment (RPA) loss to guide cross-modal representation learning, achieving significant improvements on fine-grained retrieval tasks.
HAWAII: Hierarchical Visual Knowledge Transfer for Efficient VLM: This paper proposes the Hawaii framework, which distills knowledge from multiple visual experts into a single visual encoder via Mixture of LoRA Adapters (MoLA) and Hierarchical Knowledge Distillation (HKD), significantly improving the visual understanding capability of VLMs without incurring any additional inference cost.
HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation: This work is the first to identify the systematic phenomenon that understanding capability consistently surpasses generation capability in unified multimodal large language models. It proposes the HermesFlow framework, which constructs paired understanding-generation preference data from homologous inputs, and employs Pair-DPO with iterative self-play optimization to simultaneously improve both capabilities and narrow the gap between them—without relying on any external high-quality data.
Hierarchical Self-Attention: Generalizing Neural Attention Mechanics to Multi-Scale Problems: This paper derives a Hierarchical Self-Attention (HSA) mechanism from the first principle of entropy minimization, providing a theoretically optimal attention computation method for nested signals (multimodal and multi-scale data). It further proves that HSA is the KL-divergence-optimal solution closest to standard Softmax attention under hierarchical block constraints.
HoPE: Hybrid of Position Embedding for Long Context Vision-Language Models: This work presents the first theoretical analysis of frequency allocation strategies in multimodal RoPE for long-context VLMs. It proposes HoPE, which sets the lowest frequency to zero for temporal modeling to guarantee the semantic preference property, coupled with a dynamic temporal scaling mechanism, achieving gains of 8.35% on long video understanding and 22.23% on retrieval tasks.
iFinder: Structured Zero-Shot VLM Grounding for Dash-Cam Video Reasoning: This paper proposes iFinder, a modular training-free framework that decouples dash-cam video understanding into perception (structured scene representation) and reasoning (LLM). Through a hierarchical data structure and a three-block prompting strategy, iFinder endows LLMs with interpretable spatiotemporal reasoning capabilities, achieving zero-shot superiority over end-to-end V-VLMs across four driving video benchmarks, with accident reasoning accuracy gains of up to 39%.
In-Context Compositional Learning via Sparse Coding Transformer: Inspired by sparse coding, this work reinterprets the Transformer attention mechanism as projection onto encoding and decoding dictionaries, explicitly represents compositional rules via sparse coefficients, and transfers compositional rules from in-context tasks to target tasks using a lifting scheme.
in the eye of mllm benchmarking egocentric video intent understanding with gaze-: This paper proposes the EgoGazeVQA benchmark and three gaze-guided prompting strategies (textual / visual / salience map), providing the first systematic validation of eye-gaze signals for improving egocentric video intent understanding in MLLMs. The best configuration, Qwen2.5-VL-72B + GazeS, achieves a 5.8 percentage-point gain in average accuracy.
Intervene-All-Paths: Unified Mitigation of LVLM Hallucinations across Alignment Formats: This paper proposes AllPath, a multi-path hallucination intervention framework grounded in the Transformer causal architecture. It is the first to demonstrate that hallucinations in LVLMs do not stem from a single causal path but from the interaction of three paths — image-to-input-text, image-to-output-text, and text-to-text — and that models adaptively rely on different paths depending on the question-answer alignment format. By designing lightweight key-head identification methods for each path and performing adaptive intervention, AllPath consistently reduces hallucinations across four benchmarks covering different alignment formats: POPE, MCQ-POPE, CHAIR, and MME.
JailBound: Jailbreaking Internal Safety Boundaries of Vision-Language Models: Inspired by the Eliciting Latent Knowledge (ELK) framework, this paper is the first to reveal that VLMs possess approximable safety decision boundaries in the latent space of fusion layers. It proposes JailBound, a two-stage attack framework comprising Safety Boundary Probing and Safety Boundary Crossing, which jointly optimizes image and text adversarial perturbations to cross this boundary. JailBound achieves average attack success rates of 94.32% and 67.28% in white-box and black-box settings, respectively, significantly surpassing the state of the art.
Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors: VG-LLM proposes integrating a 3D visual geometry encoder (VGGT) into multimodal large language models, enabling the extraction and fusion of 3D geometric priors from video input alone—without any explicit 3D data. This approach significantly improves MLLM performance on 3D scene understanding and spatial reasoning tasks, with the 4B model surpassing Gemini-1.5-Pro on VSI-Bench.
Learning Shared Representations from Unpaired Data: This paper proposes SUE (Spectral Universal Embedding), which is the first to demonstrate that cross-modal shared representations can be learned with almost entirely unpaired data. Independent spectral embeddings extract modality-invariant "universal" structure from random walks within each modality; a minimal number of paired samples (~100 pairs) then enables CCA-based linear alignment followed by MMD-based nonlinear fine-tuning. SUE outperforms contrastive learning using the same number of pairs by more than 250% on retrieval benchmarks.
Learning Skill-Attributes for Transferable Assessment in Video: This paper proposes CrossTrainer, a method that discovers sport-agnostic skill attributes (e.g., balance, control, hand positioning) as intermediate representations to train a multimodal language model for generating actionable feedback and proficiency assessments from video. CrossTrainer achieves up to 60% relative improvement over the state of the art in zero-shot cross-sport transfer.
Learning to Instruct for Visual Instruction Tuning: This paper proposes L2T (Learning to Instruct), which improves visual instruction tuning solely by extending the training loss to cover the instruction sequence (rather than computing loss on responses only). Without additional data and with virtually zero computational overhead, L2T achieves up to 9% relative improvement across 16 multimodal benchmarks, an 18% gain on captioning tasks, and notable hallucination reduction.
Learning to Steer: Input-dependent Steering for Multimodal LLMs: Addressing the limitation of existing steering methods that rely on fixed direction vectors incapable of adapting to diverse inputs, this paper proposes L2S (Learn-to-Steer): it first generates ideal input-specific steering vectors via contrastive prompting (P2S), then trains a lightweight 2-layer MLP to predict these vectors from the input context. This achieves input-dependent behavioral steering at negligible overhead, significantly outperforming static steering baselines on both safety enforcement and hallucination mitigation.
MDReID: Modality-Decoupled Learning for Any-to-Any Multi-Modal Object Re-Identification: This paper proposes MDReID, a framework that decouples modality features into modality-shared and modality-specific components, enabling object re-identification under arbitrary modality combinations (any-to-any ReID) and substantially outperforming existing methods in both modality-matched and modality-mismatched scenarios.
Metacognitive Sensitivity for Test-Time Dynamic Model Selection: Inspired by the concept of metacognitive sensitivity (meta-d') from cognitive science, this paper proposes a test-time dynamic model selection framework that quantifies a model's ability to "know what it doesn't know" via meta-d', combines it with instantaneous confidence scores to form a context vector, and employs a contextual bandit to online-select the optimal model, outperforming individual models across multiple datasets.
MIDAS: Misalignment-based Data Augmentation Strategy for Imbalanced Multimodal Learning: This work is the first to propose using cross-modal misaligned samples as supervised training signals—rather than treating them as noise or interference—to alleviate modality imbalance in multimodal learning. The proposed MIDAS data augmentation framework combines three complementary mechanisms: confidence-based labeling of misaligned samples, weak-modality weighting, and hard-sample weighting. MIDAS substantially outperforms existing methods across four multimodal classification benchmarks.
Mint: A Simple Test-Time Adaptation of Vision-Language Models against Common Corruptions: This work identifies embedding variance collapse—the simultaneous shrinkage of intra- and inter-class variance that erodes discriminability in the embedding space—as the root cause of CLIP's performance degradation under image corruptions. It proposes Mint, which restores embedding geometry online by maximizing pseudo-label inter-class variance (PL-inter) using only two lightweight components: a mean accumulator and a gradient accumulator. Mint consistently improves CLIP's classification accuracy across multiple corruption benchmarks even at BS=1, while running 45× faster than the strongest baseline.
MIRAGE: A Benchmark for Multimodal Information-Seeking and Reasoning in Agriculture: MIRAGE is the first multimodal benchmark constructed from real agricultural expert consultation dialogues (35,000+), evaluating vision-language models on domain-level entity identification, causal reasoning, and clarify-or-respond decision-making. It reveals a severe challenge in which even GPT-4.1 achieves only 43.9% identification accuracy.
MM-OPERA: Benchmarking Open-ended Association Reasoning for Large Vision-Language Models: This paper proposes MM-OPERA, an open-ended association reasoning benchmark comprising 11,497 instances. It evaluates the association reasoning capabilities of LVLMs through two tasks — Remote-Item Association (RIA) and In-Context Association (ICA) — and introduces an LLM-as-a-Judge scoring strategy alongside a process reward evaluation method. The benchmark reveals that even the strongest current LVLMs remain significantly behind humans.
MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios: This paper introduces MME-VideoOCR, a comprehensive video OCR evaluation benchmark comprising 25 tasks, 44 scenarios, 1,464 videos, and 2,000 manually annotated QA pairs, spanning three levels of text recognition, understanding, and reasoning. Evaluation of 18 state-of-the-art MLLMs reveals that the strongest model (Gemini-2.5 Pro) achieves only 73.7% overall, with cross-frame understanding tasks falling below 25%.
MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly: This paper introduces MMLongBench, the first comprehensive benchmark for evaluating long-context vision-language models (LCVLMs), comprising 13,331 samples spanning 5 downstream task categories, mixed image types, and 5 standardized input length levels (8K–128K tokens). Evaluation of 46 models reveals that single-task performance is a weak proxy for overall capability, and that stronger reasoning ability positively correlates with long-context performance.
MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness: The first benchmark to systematically evaluate the perspective understanding capabilities of multimodal large language models (MLLMs), comprising 10 tasks across 3 dimensions, 2,711 images, and 5,083 question–answer pairs. It reveals significant deficiencies in perspective reasoning and robustness across 43 state-of-the-art models.
MoniTor: Exploiting Large Language Models with Instruction for Online Video Anomaly Detection: This paper proposes MoniTor, a memory-based online scoring queue framework that leverages LLMs for training-free online video anomaly detection (VAD). It guides LLMs toward real-time anomaly recognition through a dual-layer memory mechanism, behavior prediction, and a standard scoring queue.
Multi-Modal Masked Autoencoders for Learning Image-Spectrum Associations for Galaxy Evolution and Cosmology: This work constructs GalaxiesML-Spectra, a large-scale multi-modal dataset of 134,533 galaxies with images, spectra, and redshifts, and adapts a Multi-Modal Masked Autoencoder (MMAE) for joint image–spectrum reconstruction and redshift regression. It demonstrates that at test time, even with spectra entirely absent, using only 25% masked images achieves a redshift prediction scatter of $\sigma_{NMAD} = 0.016$, surpassing AstroCLIP.
Multimodal Bandits: Regret Lower Bounds and Optimal Algorithms: For the multimodal multi-armed bandit problem where the reward function has at most $m$ modes, this paper proposes the first computationally feasible algorithm for solving the Graves-Lai optimization problem, achieves an asymptotically optimal regret bound, and proves that local search strategies are suboptimal.
Multimodal Negative Learning: This paper proposes the Multimodal Negative Learning (MNL) paradigm, in which dominant modalities guide weaker modalities to suppress non-target classes—rather than enforcing alignment on target classes—thereby stabilizing the decision space, preserving modality-specific information, and theoretically tightening the robustness lower bound of multimodal fusion.
Nautilus: A Large Multimodal Model for Underwater Scene Understanding: This paper presents Nautilus, the first large multimodal model supporting eight underwater scene understanding tasks. It introduces a physics-prior-driven Visual Feature Enhancement (VFE) module that explicitly rectifies underwater image degradation in feature space, improving the robustness of LMMs in underwater environments.
NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints: This paper systematically investigates the design space and scaling properties of native multimodal large language models (Native MLLMs) under data constraints. It identifies a positive log-linear optimal scaling relationship between the visual encoder and the LLM, and based on this finding proposes NaViL, which achieves competitive performance with state-of-the-art MLLMs using only approximately 600 million pre-training image-text pairs.
NeedleInATable: Exploring Long-Context Capability of Large Language Models towards Long-Structured Tables: This paper proposes NeedleInATable (NIAT), a benchmark that treats each table cell as a "needle" to evaluate the fine-grained perception capability of LLMs over long structured tables. It reveals that strong performance of existing models on complex downstream tasks may stem from dataset shortcuts rather than genuine table understanding.
NegoCollab: A Common Representation Negotiation Approach for Heterogeneous Collaborative Perception: This paper proposes the NegoCollab framework, which introduces a Negotiator module to negotiate a common representation from the local representations of heterogeneous multimodal agents during training, effectively eliminating domain gaps between heterogeneous collaborative agents and enabling low-cost collaborative connected perception.
Omni-Mol: Multitask Molecular Model for Any-to-Any Modalities: This paper proposes Omni-Mol, a unified molecular understanding and generation framework built upon a multimodal LLM. Through a 1.42M-sample instruction tuning dataset, Gradient Adaptive LoRA (GAL), and a Mixture-of-GAL-Experts (MoGE) architecture, Omni-Mol is the first single model to jointly learn 16 molecular tasks (Mol2Mol / Mol2Text / Mol2Num / Text2Mol), achieving SOTA on 13 tasks with only 2.2B parameters.
On the Value of Cross-Modal Misalignment in Multimodal Representation Learning: This paper proposes a latent variable model that formalizes cross-modal misalignment into two mechanisms—selection bias and perturbation bias—and theoretically proves that MMCL-learned representations precisely capture the invariant semantic subset unaffected by both biases, thereby unifying the opposing views of misalignment as harmful vs. beneficial.
OpenHOI: Open-World Hand-Object Interaction Synthesis with Multimodal Large Language Models: This paper proposes OpenHOI, a framework that leverages the commonsense reasoning capabilities of multimodal large language models (MLLMs) to infer contact regions and grasp types for unseen objects, enabling open-world hand-object interaction synthesis without requiring per-object training data collection.
PhysVLM-AVR: Active Visual Reasoning for Multimodal Large Language Models in Physical Environments: This paper proposes the Active Visual Reasoning (AVR) task paradigm, constructs the CLEVR-AVR simulation benchmark and the AVR-152k dataset (with rich CoT annotations), and trains the PhysVLM-AVR model to iteratively acquire information through a perception–reasoning–action closed loop in partially observable interactive environments, significantly outperforming existing MLLMs.
Praxis-VLM: Vision-Grounded Decision Making via Text-Driven Reinforcement Learning: This paper discovers that the decision-making reasoning capability of VLMs can be decoupled from visual perception—replacing image inputs with textual descriptions yields equal or higher decision accuracy. Building on this insight, Praxis-VLM trains decision-making reasoning on purely textual scenarios via multi-stage GRPO with adaptive rewards, then transfers zero-shot to visual inputs at inference time, achieving comprehensive improvements over SFT baselines on three decision-making benchmarks, with especially notable gains in OOD generalization.
PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation: PrefixKV identifies that the importance distributions of KV caches vary substantially across layers, and formalizes the per-layer cache sizing problem as a global prefix configuration search. A binary search is employed to find the optimal cumulative priority threshold that maximizes contextual information retention in each layer. At a 20% retention ratio, PrefixKV incurs only a 0.49 PPL degradation while delivering a 1.8× inference speedup.
Reading Recognition in the Wild: This paper introduces a novel reading recognition task and the first large-scale multimodal "reading-in-the-wild" dataset (100 hours). A lightweight Transformer model fusing three complementary modalities—RGB, gaze, and IMU—enables real-time reading detection on smart glasses.
Recognition through Reasoning: Reinforcing Image Geo-localization with Large Vision-Language Models: This paper proposes GLOBE — an LVLM-based image geo-localization system trained via GRPO reinforcement learning. By constructing MP16-Reason, a reasoning-oriented dataset with localizability assessment, visual-clue reasoning chains, and geographic accuracy annotations, GLOBE surpasses SOTA methods trained on millions of samples as well as large-scale open-source VLMs using only 33K training examples across multiple benchmarks.
Rethinking Multimodal Learning from the Perspective of Mitigating Classification Ability Disproportion: This paper proposes a classification ability disproportion perspective to understand modality imbalance in multimodal learning, and designs a Sustained Boosting algorithm (shared encoder + multiple configurable classifiers, jointly optimizing classification and residual errors) coupled with Adaptive Classifier Assignment (ACA). The paper theoretically proves that the cross-modal gap loss converges at $\mathcal{O}(1/T)$, and achieves substantial improvements over SOTA on six datasets including CREMAD.
Retrv-R1: A Reasoning-Driven MLLM Framework for Universal and Efficient Multimodal Retrieval: This paper proposes Retrv-R1, the first R1-style reasoning-based multimodal retrieval framework. It reduces token consumption via an Information Compression Module (ICM), preserves complete information for hard candidates through a Details Inspection Mechanism (DIM), and employs a curriculum-based RL reward to balance effectiveness and efficiency, achieving state-of-the-art performance on universal multimodal retrieval benchmarks.
Revisiting Logit Distributions for Reliable Out-of-Distribution Detection: This paper proposes LogitGap, a novel post-hoc OOD detection scoring function that explicitly exploits the "gap" between the maximum logit and the remaining logits to distinguish in-distribution (ID) from out-of-distribution (OOD) samples. A top-N selection strategy is introduced to filter noisy logits. Theoretical analysis and experiments demonstrate that LogitGap outperforms MCM and MaxLogit across multiple scenarios.
RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics: This paper proposes RoboRefer, a 3D-aware reasoning VLM trained via a two-stage SFT + RFT strategy with a metric-sensitive process reward function. It achieves precise single-step spatial understanding and multi-step spatial reasoning on spatial referring tasks, surpassing Gemini-2.5-Pro by 17.4% on RefSpatial-Bench.
RobustMerge: Parameter-Efficient Model Merging for MLLMs with Direction Robustness: From the perspective of low-rank decomposition, this paper identifies "direction robustness" as the key factor in parameter-efficient module merging (as opposed to sign conflicts in full-parameter merging), and proposes RobustMerge, which maintains singular value direction stability via complementary parameter adaptive scaling and cross-task normalization, achieving average improvements of 3.4% (seen tasks) and 4.5% (unseen tasks) on multimodal generation benchmarks.
rtv-bench benchmarking mllm continuous perception understanding and reasoning th: This paper proposes RTV-Bench, a benchmark comprising 552 videos and 4,608 QA pairs, designed to systematically evaluate MLLMs' continuous analysis capabilities in real-time video streams through three core designs: multi-timestamp QA (the same question yields different correct answers at different timestamps), hierarchical question structure, and multidimensional evaluation. Key findings include that online models outperform offline models, and that simply scaling model size or increasing frame count yields limited gains.
RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video: This paper proposes RTV-Bench, a fine-grained evaluation benchmark for assessing the continuous real-time video analysis capabilities of MLLMs. Comprising 552 videos and 4,608 QA pairs, it comprehensively evaluates model perception, understanding, and reasoning in dynamic video streams through a multi-timestamp QA mechanism, hierarchical question structure, and multi-dimensional assessment.
Scene-Aware Urban Design: A Human-AI Recommendation Framework Using Co-Occurrence Embeddings and Vision-Language Models: This paper proposes a human-AI collaborative computer vision framework that employs Grounding DINO for urban object detection, constructs co-occurrence embeddings from the ADE20K dataset to capture real-world spatial configurations, leverages a VLM for scene-aware third-object recommendation, and generates 3D models for AR preview — all aimed at enabling residents to participate in micro-scale urban design.
SCOPE: Saliency-Coverage Oriented Token Pruning for Efficient Multimodal LLMs: This paper proposes SCOPE, a visual token pruning strategy that jointly models saliency and coverage. By iteratively selecting tokens with the highest SCOPE scores, it preserves semantic completeness and retains 96% of LLaVA-1.5's performance under a 9× token reduction.
SD-VLM: Spatial Measuring and Understanding with Depth-Encoded Vision-Language Models: This paper proposes MSMU, a large-scale quantitative spatial reasoning dataset (700K QA pairs, 2.5M numerical annotations), and Depth Positional Encoding (DPE), enabling VLMs to achieve strong quantitative spatial measurement and understanding without relying on 3D point clouds. SD-VLM outperforms GPT-4o by 26.91% on MSMU-Bench.
Seeing is Believing? Mitigating OCR Hallucinations in Multimodal Large Language Models: This paper addresses OCR hallucinations in MLLMs under degraded document conditions. It introduces KIE-HVQA, the first benchmark for evaluating hallucinations in degraded document scenarios, and proposes a multi-objective reward reinforcement learning framework based on GRPO. The resulting 7B-parameter model achieves approximately 28% higher hallucination-suppression accuracy than GPT-4o.
See&Trek: Training-Free Spatial Prompting for Multimodal Large Language Model: This paper proposes See&Trek, a training-free and GPU-free spatial prompting framework that enhances spatial understanding in MLLMs through maximum semantic richness sampling and motion reconstruction, achieving up to 3.5% improvement on VSI-Bench.
Sherlock: Self-Correcting Reasoning in Vision-Language Models: The first systematic study of self-correction capabilities in reasoning VLMs: existing reasoning VLMs are found to be nearly incapable of self-correction (<10% exhibit an aha moment). The paper proposes Sherlock, a three-stage training framework (SFT cold-start → offline trajectory-level preference learning → online self-iterative improvement) that surpasses LLaVA-CoT/Mulberry/LlamaV-o1 (which use 100K–260K annotations) using only 20K labeled samples.
SITCOM: Scaling Inference-Time COMpute for VLAs: SITCOM proposes an inference-time compute scaling framework inspired by Model Predictive Control (MPC). It performs multi-step rollout simulation of a pretrained VLA using a learned dynamics model and selects optimal trajectories via a reward model, transforming a single-step VLA into a robust long-horizon planner. On the SIMPLER benchmark, it improves task success rate from 48% to 72%.
Situat3DChange: Situated 3D Change Understanding Dataset for Multimodal Large Language Models: This work introduces the Situat3DChange dataset (174K data instances) that unifies dynamic scene change perception and situated awareness understanding under a perception–action paradigm, and proposes SCReasoner—an efficient 3D MLLM for point cloud comparative reasoning.
Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models: This paper extends sparse autoencoders (SAEs) to vision-language models (e.g., CLIP), proposes the MonoSemanticity score (MS) to quantitatively evaluate the monosemanticity of neurons, and demonstrates that manipulating SAE neurons can directly steer multimodal large language models (e.g., LLaVA) to insert or suppress specific concepts.
SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards: This paper proposes SpatialThinker, which trains MLLMs to construct scene graphs and perform structured spatial reasoning via online RL with multi-objective dense spatial rewards (lexicographic gating over format → count → accuracy → spatial localization). Using only 7K samples, it surpasses GPT-4o on 3DSRBench by 12.1%.
SpatialTraceGen: High-Fidelity Traces for Efficient VLM Spatial Reasoning Distillation: This paper proposes SpatialTraceGen, a framework that distills high-quality multi-step tool-use reasoning traces from large teacher models via automated verification, enabling efficient fine-tuning of small VLMs for spatial reasoning.
SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning: This paper proposes SRPO (Self-Reflection enhanced reasoning with Group Relative Policy Optimization), a two-stage reflection-aware RL framework. Stage 1 constructs reflection data via large model distillation for SFT cold-start; Stage 2 designs a reflection-aware reward function within GRPO to reinforce concise and effective self-reflection. SRPO achieves state-of-the-art results at the 7B/32B scale on multimodal reasoning benchmarks including MathVista, MathVision, and MMMU-Pro.
SSR: Enhancing Depth Perception in VLMs via Rationale-Guided Spatial Reasoning: This paper proposes the SSR framework, which converts raw depth information into structured textual reasoning rationales and compresses them into compact latent embeddings via knowledge distillation, enhancing the spatial reasoning capabilities of existing VLMs in a plug-and-play manner.
Struct2D: A Perception-Guided Framework for Spatial Reasoning in MLLMs: This paper proposes Struct2D, a perception-guided prompting framework that converts 3D perception outputs into structured 2D representations (BEV images + object labels + metadata), enabling MLLMs to perform complex spatial reasoning without explicit 3D input. The authors also construct Struct2D-Set, a large-scale instruction tuning dataset containing 200K QA pairs.
Structure-Aware Fusion with Progressive Injection for Multimodal Molecular Representation Learning: This paper proposes MuMo, a framework that fuses 2D topological and 3D geometric information into stable structural priors via a Structured Fusion Pipeline (SFP), and asymmetrically integrates these priors into the sequence stream through a Progressive Injection (PI) mechanism, achieving an average improvement of 2.7% over competitive baselines across 29 molecular property prediction benchmarks and ranking first on 22 of them.
Systematic Reward Gap Optimization for Mitigating VLM Hallucinations: This paper proposes Topic-level Preference Rewriting (TPR), which systematically optimizes the reward gap configuration in preference data through fine-grained semantic control at the topic level, combined with a curriculum learning strategy that progressively increases the difficulty of negative samples, achieving approximately 93% hallucination reduction across multiple hallucination benchmarks.
T-Rex: Task-Adaptive Spatial Representation Extraction for Robotic Manipulation with VLMs: This paper proposes the T-Rex framework, which dynamically selects the optimal spatial representation extraction scheme (point / vector / 6D pose) according to task complexity, and introduces Chain of Grounding (CoG) to guide VLMs through step-by-step reasoning, enabling training-free open-vocabulary robotic manipulation.
Test-Time Spectrum-Aware Latent Steering for Zero-Shot Generalization in Vision-Language Models: This paper proposes STS (Spectrum-Aware Test-Time Steering), a lightweight test-time adaptation method that extracts a low-dimensional semantic subspace via SVD decomposition of text embeddings, and learns a small set of coefficients to steer text prototypes within this subspace to handle distribution shift. STS requires no backpropagation through large encoders, runs 8× faster than TPT with 12× less memory, and substantially outperforms existing TTA methods on OOD datasets.
Text to Robotic Assembly of Multi Component Objects using 3D Generative AI and Vision Language Models: This paper proposes an end-to-end pipeline that converts natural language input into 3D mesh models via 3D generative AI, then leverages zero-shot multimodal reasoning of VLMs to automatically decompose the mesh into multi-component 3D models (structural components + panel components), which are subsequently assembled into physical objects by a robotic arm. The system also supports interactive user feedback through dialogue to adjust component assignments.
The Illusion of Progress? A Critical Look at Test-Time Adaptation for Vision-Language Models: This paper introduces TTA-VLM, a unified benchmark evaluating 8 episodic and 7 online test-time adaptation (TTA) methods across 15 datasets under controlled experimental conditions. Three surprising findings emerge: (1) existing TTA methods offer only marginal improvements over the early TPT baseline; (2) TTA methods collaborate poorly with training-time fine-tuning approaches; (3) accuracy gains come at the cost of calibration, OOD detection, and robustness.
To Think or Not To Think: A Study of Explicit Thinking in Rule-Based Visual Reinforcement Fine-Tuning: This paper systematically investigates whether explicit thinking is necessary in rule-based reinforcement fine-tuning (RFT). It finds that on visual perception tasks, No-Thinking-RFT consistently outperforms the conventional think-then-answer paradigm, and proposes an Adaptive-Thinking approach that allows models to autonomously determine whether to reason based on their own capability and task complexity.
To See or To Read: User Behavior Reasoning in Multimodal LLMs: This paper proposes BehaviorLens, a benchmarking framework that systematically compares three representations of user behavior history — text sequences, scatter plots, and flowcharts — for next-purchase prediction with MLLMs. Visual representations are shown to improve prediction accuracy by up to 87.5% over equivalent text representations without incurring additional computational overhead.
TOMCAT: Test-time Comprehensive Knowledge Accumulation for Compositional Zero-Shot Learning: This paper proposes TOMCAT, which dynamically updates compositional prototypes by accumulating dual-modality (textual and visual) knowledge from unlabeled test data at test time, addressing label distribution shift and achieving state-of-the-art performance on four CZSL benchmarks.
Towards Comprehensive Scene Understanding: Integrating First and Third-Person Views for LVLMs: This paper proposes E3VQA, the first multi-view VQA benchmark, and M3CoT, a prompting technique that fuses three complementary scene graphs, to enhance multi-view scene understanding in Large Vision-Language Models (LVLMs), achieving gains of 4.84% on GPT-4o and 5.94% on Gemini 2.0 Flash.
Towards Evaluating Proactive Risk Awareness of Multimodal Language Models: This paper introduces PaSBench, a benchmark for evaluating the proactive risk awareness of multimodal language models — requiring models to autonomously observe environments and issue safety warnings without any user query. An evaluation of 36 models reveals that the strongest model (Gemini-2.5-pro) achieves only 71% accuracy, with 45% of risks failing to be detected consistently. The core bottleneck is identified as unstable proactive reasoning rather than a lack of safety knowledge.
Training-free Online Video Step Grounding: This paper proposes BaGLM, a training-free online video step grounding method that integrates LLM-estimated step dependencies and LMM-estimated step progress into zero-shot LMM predictions via Bayesian filtering, outperforming existing trained offline methods on three datasets.
TRoVe: Discovering Error-Inducing Static Feature Biases in Temporal Vision-Language Models: TRoVe proposes an automated method for discovering static feature biases that induce systematic prediction errors in temporal VLMs. Through a dual-scoring mechanism combining an Error Contribution Score (ECS) and a Static Bias Score (SBS), TRoVe outperforms baselines by 28.6% on 101 synthetic models and successfully identifies novel biases in 7 real-world VLMs.
Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition: This paper proposes Uni-MuMER, which performs unified multi-task fine-tuning of an open-source VLM via three data-driven tasks (Tree-CoT, Error-Driven Learning, and Symbol Counting), achieving substantial improvements over both specialized lightweight models and zero-shot commercial VLMs on the CROHME and HME100K benchmarks.
Unified Reinforcement and Imitation Learning for Vision-Language Models: This paper proposes RIL (Unified Reinforcement and Imitation Learning), a training framework that combines GRPO-based reinforcement learning with GAIL-style adversarial imitation learning to substantially improve the performance of small VLMs (7B) by learning the text generation style of large VLMs (72B), without incurring additional inference latency or requiring an explicit "thinking" process.
Unifying Vision-Language Latents for Zero-Label Image Caption Enhancement: This paper proposes the ViZer framework, which improves the image captioning capability of VLMs through a unified vision-language latent space alignment training paradigm—requiring no text annotations whatsoever. Using only raw image data, the model learns to generate more grounded and descriptive captions.
UniTok: A Unified Tokenizer for Visual Generation and Understanding: This paper proposes UniTok, a unified tokenizer for visual generation and understanding that overcomes the representation capacity bottleneck of discrete tokens via Multi-Codebook Quantization (MCQ). UniTok achieves simultaneous state-of-the-art records of 0.38 rFID and 78.6% zero-shot accuracy on ImageNet, and can be seamlessly integrated into MLLMs to enable both generation and understanding.
Unveiling Chain of Step Reasoning for Vision-Language Models with Fine-grained Rewards: This paper proposes the Chain-of-Step (CoS) reasoning framework, which decomposes VLM reasoning chains into structured steps consisting of Name, Thought, and Reflection components. A step-level Process Reward Model (PRM) is trained to provide fine-grained reward signals. Combined with iterative DPO and step-level beam search, the framework systematically improves VLM reasoning—achieving an average of 73.4% (+4.0%) across 6 benchmarks on InternVL-2.5-MPO-8B and 64.2% (+12.1%) on LLaVA-NeXT-8B—while revealing the counterintuitive finding that quality matters far more than length in VLM reasoning, contrary to trends observed in LLM research.
VAGEN: Reinforcing World Model Reasoning for Multi-Turn VLM Agents: VAGEN is a framework that structures the reasoning process of VLM agents into StateEstimation and TransitionModeling to build an internal world model, and combines WorldModeling Reward with Bi-Level GAE for efficient multi-turn RL training. A 3B model trained under this framework (0.82) surpasses GPT-5 (0.75) and Gemini 2.5 Pro (0.67).
VaMP: Variational Multi-Modal Prompt Learning for Vision-Language Models: This paper proposes VaMP, a variational multi-modal prompt learning framework that models text-side prompts as latent variables and performs instance-level uncertainty modeling via variational inference. Combined with a class-aware prior for regularizing the latent space, VaMP significantly improves CLIP's downstream adaptation under few-shot and domain generalization settings.
Video-R1: Reinforcing Video Reasoning in MLLMs: Inspired by DeepSeek-R1, this paper presents the first systematic exploration of applying the R1 paradigm (rule-based RL) to video reasoning. It proposes the T-GRPO algorithm to explicitly encourage temporal reasoning, constructs a mixed image-video training dataset, and achieves 37.1% accuracy on VSI-Bench, surpassing GPT-4o.
Video-SafetyBench: A Benchmark for Safety Evaluation of Video LVLMs: This paper presents Video-SafetyBench, the first comprehensive benchmark for safety evaluation of video LVLMs. It comprises 2,264 video-text pairs spanning 48 fine-grained unsafe categories, constructed via a controllable video generation pipeline. A confidence-based evaluation metric, RJScore, is proposed to assess model outputs. Large-scale evaluation across 24 LVLMs reveals an average attack success rate of 67.2% under benign queries.
VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning: This paper proposes VideoRFT, which extends the reinforced fine-tuning (RFT) paradigm to video reasoning via a cognition-inspired multi-expert CoT data construction pipeline and a novel semantic consistency reward. Two datasets are constructed: VideoRFT-CoT-102K (for SFT) and VideoRFT-RL-310K (for RL), achieving state-of-the-art performance on 6 video reasoning benchmarks.
VIPAMIN: Visual Prompt Initialization via Embedding Selection and Subspace Expansion: This paper proposes VIPAMIN—a zero-extra-parameter visual prompt initialization strategy comprising two modules: attention-guided semantic Matching and orthogonal subspace injection (Orthogonalizing). It addresses two failure modes of self-supervised VPT—prompt attention uniformization and subspace collapse—requiring only a single forward pass, and achieves state-of-the-art performance across 24 visual tasks.
Vision Function Layer in Multimodal LLMs: This paper identifies that vision-related functional decoding in MLLMs is concentrated in specific narrow layer blocks (Vision Function Layers), exhibiting a consistent hierarchical order across model families (recognition → counting → grounding → OCR). Building on this finding, the authors propose VFL-LoRA (matching full-LoRA performance with only 1/3 of the parameters) and VFL-select (achieving 98% of full-data performance with 20% of the data).
ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding: To address the difficulty of draft models in handling redundant visual tokens during VLM speculative decoding, this paper proposes ViSpec, a framework that achieves significant acceleration (up to 3.22×) in VLM speculative decoding for the first time, via a visual adapter for image token compression, global visual feature injection, and synthetic training data generation.
Visual Instruction Bottleneck Tuning: This paper is the first to apply the Information Bottleneck (IB) principle to end-to-end instruction tuning of multimodal large language models. It proposes Visual Instruction Bottleneck Tuning (Vittle), which inserts a lightweight bottleneck layer inside the LLM to learn minimally sufficient representations. Vittle consistently improves robustness across 30 distribution shift scenarios without sacrificing performance on standard benchmarks.
Visual Structures Help Visual Reasoning: Addressing the Binding Problem in LVLMs: This paper proposes VISER (Visual Input Structure for Enhanced Reasoning), which constructs spatial partitions by superimposing equidistant horizontal lines with numeric labels onto input images, combined with a "row-by-row scan" textual instruction. This approach converts the parallel visual processing of LVLMs into sequential region-by-region parsing. Without modifying the model, without training, and within a single query, VISER substantially mitigates the binding problem and improves performance on visual reasoning tasks including counting, visual search, scene description, and spatial relationship understanding.
VLA-Cache: Efficient Vision-Language-Action Manipulation via Adaptive Token Caching: This paper proposes VLA-Cache, a training-free inference acceleration method for VLA models that identifies and caches KV representations of static visual tokens across frames, filters out task-relevant tokens, and adaptively adjusts the reuse ratio per layer, achieving 1.7× speedup with negligible loss in task success rate.
VT-FSL: Bridging Vision and Text with LLMs for Few-Shot Learning: This paper proposes VT-FSL, a framework that leverages Cross-modal Iterative Prompting (CIP) to jointly exploit class names and support images for driving LLMs to generate accurate, visually grounded textual descriptions and zero-shot synthesize semantically consistent images. Combined with Kernelized Volume Contrastive Learning (CGA) for global nonlinear cross-modal alignment, VT-FSL achieves an average classification accuracy improvement of 4.2% across 10 few-shot learning benchmarks.
Watch and Listen: Understanding Audio-Visual-Speech Moments with Multimodal LLM: This paper proposes TriSense — a tri-modal (visual + audio + speech) large language model that adaptively modulates per-modality weights via a Query-Based Connector for robust video temporal understanding, supported by the TriSense-2M dataset containing 2 million annotated samples.
WearVQA: A Visual Question Answering Benchmark for Wearables in Egocentric Authentic Real-world scenarios: This paper introduces WearVQA, the first VQA benchmark specifically designed for wearable device (smart glasses) scenarios. It comprises 2,520 egocentric image–question–answer triplets, systematically covering 7 visual domains, 10 cognitive task types, and 6 categories of wearable-specific image quality degradation. An accompanying LLM-as-a-judge evaluation framework achieves 96% accuracy, and the benchmark reveals that current SOTA multimodal models attain only 24–52% accuracy in this setting.
What Can RL Bring to VLA Generalization? An Empirical Study: This paper systematically investigates the effect of RL fine-tuning on the generalization capabilities of Vision-Language-Action (VLA) models. The study finds that PPO is the most effective RL algorithm, significantly outperforming DPO and GRPO; RL yields substantially greater OOD generalization than SFT in semantic understanding and execution robustness, while achieving comparable visual robustness.
When One Modality Sabotages the Others: A Diagnostic Lens on Multimodal Reasoning: This paper introduces the concept of modality sabotage as a diagnostic failure mode, proposes a lightweight and model-agnostic evaluation layer that treats each modality as an independent agent, and exposes "contributors" versus "saboteurs" through simple fusion. Applied to multimodal sentiment recognition benchmarks, the framework reveals systematic differences in per-modality reliability.
When Semantics Mislead Vision: Mitigating Large Multimodal Models Hallucinations: This paper identifies a "semantic hallucination" problem in Large Multimodal Models (LMMs) for scene text recognition—where non-semantic text is misread as semantically plausible words. Analysis reveals that Transformer layers whose attention is more focused on text regions are less prone to hallucination. Based on this finding, the authors propose a training-free framework, ZoomText + Grounded Layer Correction, achieving approximately 4–5% improvement on TextHalu-Bench and approximately 4% on ST-VQA.
STRUCTURE: With Limited Data for Multimodal Alignment, Let the Structure Guide You: This paper proposes STRUCTURE regularization and a representation-similarity-based layer selection strategy that achieves high-quality cross-modal alignment between frozen unimodal foundation models using only tens of thousands of paired samples (less than 1% of conventional data requirements), yielding average improvements of 51.6% and 91.8% across 24 zero-shot classification and retrieval benchmarks.
Zero-Shot Robustness of Vision Language Models Via Confidence-Aware Weighting: This paper proposes CAW (Confidence-Aware Weighting), an adversarial fine-tuning loss function for CLIP that focuses on hard adversarial examples via confidence-aware weighting, combined with feature alignment regularization to preserve pre-trained semantic knowledge. CAW achieves state-of-the-art zero-shot robustness under AutoAttack with lower memory overhead.

🏥 Medical Imaging¶

3D-RAD: A Comprehensive 3D Radiology Med-VQA Dataset with Multi-Temporal Analysis and Diverse Diagnostic Tasks: This paper introduces 3D-RAD — the first large-scale 3D medical VQA benchmark, comprising 170K CT-based question-answer pairs across six clinical task categories (including a novel multi-temporal diagnosis task), accompanied by a 136K training set. The benchmark reveals critical deficiencies of existing VLMs in 3D temporal reasoning.
A Novel Approach to Classification of ECG Arrhythmia Types with Latent ODEs: This work combines a path-minimized Latent ODE encoder with a gradient-boosted decision tree (GBDT) into a two-stage ECG arrhythmia classification pipeline. On the MIT-BIH dataset, the macro AUC-ROC degrades only marginally from 0.984 at 360 Hz to 0.976 at 45 Hz, demonstrating strong robustness to sampling frequency variation.
A Unified Solution to Video Fusion: From Multi-Frame Learning to Benchmarking: This paper proposes UniVF, the first unified video fusion framework based on multi-frame learning, optical flow feature warping, and temporal consistency loss, along with VF-Bench, the first video fusion benchmark covering four major fusion tasks (multi-exposure, multi-focus, infrared-visible, and medical), achieving state-of-the-art performance across all sub-tasks.
A Variational Manifold Embedding Framework for Nonlinear Dimensionality Reduction: This paper proposes a variational manifold embedding framework that formalizes dimensionality reduction as an optimization problem over smooth embedding maps (minimizing the KL divergence between a prior distribution and the pullback of the data distribution), theoretically unifying PCA and nonlinear dimensionality reduction methods, and leverages the calculus of variations (Euler-Lagrange equations) and Noether's theorem to derive interpretable constraints on optimal embeddings.
AANet: Virtual Screening under Structural Uncertainty via Alignment and Aggregation: To address the unavailability of holo protein structures in real-world drug discovery, this paper proposes AANet—a framework that aligns representations via tri-modal contrastive learning (ligand–holo pocket–detected cavity) and aggregates multiple candidate binding sites through cross-attention. AANet substantially outperforms SOTA methods in blind screening on apo/predicted protein structures (EF1% on DUD-E: 11.75 → 37.19).
Active Target Discovery under Uninformative Prior: The Power of Permanent and Transient Memory: This paper proposes EM-PTDM, a framework inspired by the dual-memory system in neuroscience. It leverages a pretrained diffusion model as "permanent memory" and incorporates a lightweight "transient memory" module based on Doob's h-transform to achieve efficient active target discovery without any domain-specific prior data, with theoretical guarantees of monotonic prior improvement.
Amortized Active Generation of Pareto Sets: This paper proposes the A-GPS framework, which learns a conditional generative model over the Pareto set to perform online discrete black-box multi-objective optimization. It employs a non-dominance class probability estimator (CPE) as an implicit substitute for explicit hypervolume computation in PHVI, and achieves amortized posterior preference conditioning via preference direction vectors (without retraining). The approach demonstrates superior sample efficiency on synthetic benchmarks and protein design tasks.
Atomic Diffusion Models for Small Molecule Structure Elucidation from NMR Spectra: This paper proposes ChefNMR, the first end-to-end framework based on 3D atomic diffusion models that directly predicts the molecular structure of unknown small molecules (especially complex natural products) from 1D NMR spectra and molecular formulae alone, achieving state-of-the-art performance on both synthetic and experimental datasets.
GraphFLA: Augmenting Biological Fitness Prediction Benchmarks with Landscape Features: GraphFLA is an efficient fitness landscape analysis framework that computes 20 biologically meaningful landscape features (ruggedness / epistasis / navigability / neutrality) across 5,300+ real-world landscapes (ProteinGym / RNAGym / CIS-BP), revealing that model performance is highly dependent on landscape topology—e.g., VenusREM outperforms ProSST on highly navigable landscapes but underperforms it on highly epistatic ones—while processing one million mutants in just 20 seconds (vs. 5 hours for MAGELLAN).
Autoencoding Random Forests: RFAE is the first principled encode-decode framework for random forests. It exploits the positive-definiteness and universality of the RF kernel to derive low-dimensional encodings via diffusion-map spectral decomposition, and decodes back to the original feature space through k-NN regression in leaf-node space. Across 20 tabular datasets, RFAE achieves an average reconstruction rank of 1.80, substantially outperforming TVAE (3.38) and AE (3.27), and is successfully applied to MNIST reconstruction and scRNA-seq batch-effect removal.
BarcodeMamba+: Advancing State-Space Models for Fungal Biodiversity Research: BarcodeMamba+ is an SSM-based foundation model for fungal ITS DNA barcode classification. By adopting a pretrain-then-finetune paradigm to leverage large-scale unlabeled sequences, and incorporating three enhancements—hierarchical label smoothing, inverse square-root weighted loss, and multi-head outputs—it substantially outperforms BLAST, CNN, and Transformer baselines across all taxonomic ranks on three test sets, achieving a top species-level accuracy of 88.9%.
CrossNovo: Bidirectional Representations Augmented Autoregressive Biological Sequence Generation: CrossNovo integrates autoregressive (AR) and non-autoregressive (NAR) decoders through a shared spectrum encoder, importance annealing, and gradient-blocked knowledge distillation, enabling the bidirectional global understanding of NAR to augment AR sequence generation. On the 9-Species benchmark, it achieves amino acid accuracy of 0.811 (+2.6%) and peptide recall of 0.654 (+5.3%).
Brain Harmony: A Multimodal Foundation Model Unifying Morphology and Function into 1D Tokens: The first multimodal brain foundation model that unifies structural morphology (T1 sMRI) and functional dynamics (fMRI), compressing high-dimensional neuroimaging data into compact 1D token representations via Geometric Harmonics Pre-alignment and Temporally Adaptive Patch Embedding (TAPE). The model consistently outperforms prior methods on neurodevelopmental/neurodegenerative disease diagnosis and cognitive prediction tasks.
Bridging Graph and State-Space Modeling for Intensive Care Unit Length of Stay Prediction: This paper proposes S2G-Net, a dual-branch architecture that integrates Mamba state-space temporal encoding with a multi-view graph neural network (GraphGPS) for ICU length-of-stay (LOS) prediction, achieving comprehensive improvements over sequential, graph-based, and hybrid baselines on MIMIC-IV.
Care-PD: A Multi-Site Anonymized Clinical Dataset for Parkinson's Disease Gait Assessment: This work introduces Care-PD — the largest multi-site anonymized 3D mesh dataset for Parkinson's disease (PD) gait analysis to date, comprising 9 cohorts, 8 clinical centers, 362 subjects, and 8,477 walking bouts. It provides a systematic benchmark for UPDRS gait scoring and motion pre-training tasks, demonstrating that fine-tuning on Care-PD reduces MPJPE from 60.8 mm to 7.5 mm and improves F1 by 17 percentage points.
CGBench: Benchmarking Language Model Scientific Reasoning for Clinical Genetics Research: This paper introduces CGBench, a clinical genetics benchmark grounded in ClinGen expert annotations, designed to evaluate the scientific literature reasoning capabilities of LLMs from both variant and gene curation perspectives. The benchmark encompasses three tasks—evidence scoring, evidence verification, and experimental evidence extraction—and finds that reasoning models perform best on fine-grained tasks but underperform non-reasoning models on high-level judgments.
CodeCrash: Exposing LLM Fragility to Misleading Natural Language in Code Reasoning: This paper proposes CodeCrash, a stress-testing framework that systematically evaluates the code reasoning robustness of 17 LLMs through functionally equivalent structural perturbations and misleading natural language injections (comments, print statements, and hints). The framework reveals an average performance drop of 23.2% across models, with CoT recovering only to 13.8%, and is the first to identify the "Reasoning Collapse" phenomenon in large reasoning models (LRMs).
Compressing Biology: Evaluating the Stable Diffusion VAE for Phenotypic Drug Discovery: This work presents the first systematic evaluation of the Stable Diffusion VAE (SD-VAE) for reconstructing Cell Painting fluorescence microscopy images. Results show that SD-VAE preserves phenotypic information well at both the pixel level and the biological signal level (with negligible drop in Fraction Retrieved), and that the general-purpose feature extractor InceptionV3 matches or outperforms the domain-specific model OpenPhenom on retrieval tasks.
ConfRover: Simultaneous Modeling of Protein Conformation and Dynamics via Autoregression: ConfRover proposes an autoregressive framework that factorizes protein MD trajectories into frame-wise conditional generation $p(\mathbf{x}^{1:L}) = \prod_l p(\mathbf{x}^l | \mathbf{x}^{<l})$, and through a modular architecture consisting of an encoder, a causal Transformer, and an SE(3) diffusion decoder, unifies three tasks—trajectory simulation, time-independent conformational sampling, and conformational interpolation—within a single model for the first time, achieving comprehensive improvements over MDGen on the ATLAS benchmark.
Consistent Sampling and Simulation: Molecular Dynamics with Energy-Based Diffusion Models: This paper identifies an inconsistency between sampling and simulation in diffusion models (particularly at small diffusion timesteps), proposes a Fokker-Planck-based regularization term to enforce consistency, and combines it with a time-partitioned Mixture-of-Experts (MoE) strategy to achieve consistent and efficient sampling and molecular dynamics simulation across multiple biomolecular systems.
Convolutional Monge Mapping between EEG Datasets to Support Independent Component Labeling: This paper extends CMMN (Convolutional Monge Mapping Normalization) by proposing two strategies — channel-averaged PSD with $\ell_1$-normalized barycenter and subject-to-subject matching — to generate a single time-domain filter for domain adaptation across EEG datasets with differing channel counts. On independent component (IC) brain/non-brain classification, the F1 score improves from 0.77 to 0.84, surpassing ICLabel (0.88→0.91).
CureAgent: A Training-Free Executor-Analyst Framework for Clinical Reasoning: CureAgent proposes an Executor-Analyst collaborative framework that decouples precise tool invocation (TxAgent/Llama-8B as Executor) from high-level clinical reasoning (Gemini 2.5 as Analyst). Combined with a Stratified Ensemble Late Fusion topology that preserves evidence diversity, the system achieves 83.8% accuracy on CURE-Bench without end-to-end fine-tuning, and reveals two critical scaling findings: the context–performance paradox and the curse of dimensionality in action space.
CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays: This paper proposes CheXStruct and CXReasonBench — a structured diagnostic reasoning evaluation framework for chest X-rays that employs multi-path, multi-stage assessment to reveal critical deficiencies in existing LVLMs at intermediate reasoning steps.
DCA: Graph-Guided Deep Embedding Clustering for Brain Atlases: DCA (Deep Cluster Atlas) proposes a graph-guided deep embedding clustering framework that combines voxel-level spatiotemporal embeddings from a pretrained Swin-UNETR with KNN graph spatial regularization. By aligning soft assignments with atlas clustering auxiliary labels via KL divergence, the framework generates functionally homogeneous and spatially contiguous individualized brain atlases. On the HCP dataset, DCA achieves 98.8% improvement in homogeneity and 29% improvement in silhouette coefficient, and outperforms existing atlases on downstream tasks including autism diagnosis and cognitive decoding.
De novo generation of functional terpene synthases using TpsGPT: TpsGPT fine-tunes a distilled ProtGPT2 Tiny (38.9M parameters) on 79K terpene synthase (TPS) sequences to generate 28K candidate sequences, which are subsequently filtered through a multi-stage pipeline (perplexity / pLDDT / EnzymeExplorer / CLEAN / InterPro / Foldseek) to yield 7 de novo TPS sequences that are evolutionarily distant (<60% sequence identity) yet structurally conserved. Wet-lab experiments confirm that 2 of the 7 candidates possess TPS enzymatic activity—achieving functional enzyme de novo design at a GPU cost below $200.
Demo: Generative AI helps Radiotherapy Planning with User Preference: This paper proposes the Flexible Dose Proposer (FDP), a two-stage training framework (VQ-VAE pretraining + multi-condition encoding) that enables slider-based interactive 3D dose distribution prediction incorporating user preferences. The system is integrated into the Eclipse clinical treatment planning system and outperforms Varian RapidPlan in head-and-neck cancer radiotherapy scenarios.
Demo: Guide-RAG: Evidence-Driven Corpus Curation for Retrieval-Augmented Generation in Long COVID: This paper systematically evaluates six RAG corpus configurations for Long COVID clinical QA. The GS-4 configuration—combining clinical guidelines with high-quality systematic reviews—consistently outperforms both single-guideline and large-scale literature retrieval baselines across faithfulness, relevance, and comprehensiveness. The authors further introduce the Guide-RAG framework and the LongCOVID-CQ evaluation dataset.
DermaCon-IN: A Multi-concept Annotated Dermatological Image Dataset of Indian Skin Disorders: This work introduces DermaCon-IN—the first densely annotated dermatological image dataset predominantly featuring Indian skin tones (5,450 images / 3,002 patients / 245 diagnoses)—providing three-level hierarchical diagnostic labels, 47 lesion descriptors, and 49 anatomical site annotations, with benchmark evaluations using CNN, ViT, and concept bottleneck model architectures.
DesignX: Human-Competitive Algorithm Designer for Black-Box Optimization: This paper proposes DesignX, the first automated algorithm design framework that jointly learns two sub-tasks—optimizer workflow generation and dynamic hyperparameter control—through dual Transformer agents pre-trained at scale on 10k synthetic problems. DesignX surpasses human-designed optimizers on both synthetic benchmarks and real-world tasks including protein docking, AutoML, and UAV path planning.
DIsoN: Decentralized Isolation Networks for Out-of-Distribution Detection in Medical Imaging: This paper proposes Decentralized Isolation Networks (DIsoN), which detects OOD samples by training a binary classifier to "isolate" a test sample from training data, and leverages training data information without sharing it through decentralized parameter exchange. The method achieves state-of-the-art performance across 12 OOD detection tasks on 4 medical imaging datasets.
Ditch the Denoiser: Emergence of Noise Robustness in Self-Supervised Learning from Data Curriculum: A fully self-supervised noise-robust representation learning framework is proposed, leveraging a "denoised→noisy" data curriculum strategy combined with denoised-teacher regularization. This enables SSL models such as DINOv2 to directly process noisy inputs at inference time without any denoiser, achieving a 4.8% improvement in linear probing accuracy under extreme Gaussian noise on ImageNet-1k.
Doctor Approved: Generating Medically Accurate Skin Disease Images through AI-Expert Feedback: This paper proposes MAGIC, a framework that encodes dermatologist-defined clinical checklists into structured evaluation prompts executable by MLLMs (e.g., GPT-4o), and uses the resulting feedback to fine-tune diffusion models via DPO or reward-based fine-tuning (RFT), generating clinically accurate skin disease images for data augmentation. MAGIC achieves +9.02% improvement on a 20-class skin disease classification task and +13.89% in few-shot settings.
Domain-Adaptive Transformer for Data-Efficient Glioma Segmentation in Sub-Saharan MRI: This paper proposes SegFormer3D+, a domain-adaptive Transformer architecture tailored for heterogeneous MRI data from Sub-Saharan Africa. By integrating histogram matching, radiomics-guided stratified sampling, a frequency-aware dual-path encoder, and a dual attention mechanism, the model achieves a mean Dice of 0.81 for glioma segmentation with only 60 annotated cases for fine-tuning, outperforming nnU-Net by +2.5%.
Dual Mixture-of-Experts Framework for Discrete-Time Survival Analysis: This paper proposes a Dual Mixture-of-Experts (Dual MoE) framework for discrete-time survival analysis, combining a feature encoder MoE (for modeling patient subgroup heterogeneity) with a hazard network MoE (for capturing temporal dynamics). The framework achieves improvements of up to 0.04 in time-dependent C-index on the METABRIC and GBSG breast cancer datasets.
DyG-Mamba: Continuous State Space Modeling on Dynamic Graphs: DyG-Mamba introduces continuous state space models (SSMs) into dynamic graph learning. It proposes a temporal span-aware continuous SSM that models irregular time intervals via an exponential decay function inspired by the Ebbinghaus forgetting curve, combined with input-dependent parameters constrained by spectral norm for Lipschitz robustness. The method achieves an average rank of 2.42 across 12 dynamic graph benchmarks (vs. DyGFormer's 2.92) while maintaining $O(bdL)$ linear complexity.
Dynamic Causal Discovery in Alzheimer's Disease through Latent Pseudotime Modelling: This paper applies BN-LTE (Bayesian Network with Latent Time Embedding) to real-world ADNI data from AD patients to infer dynamic causal graphs that evolve along a disease pseudotime axis. The learned pseudotime achieves a diagnostic AUC of 0.82, substantially outperforming chronological age (AUC 0.59), and reveals dynamic causal relationships between emerging biomarkers NfL/GFAP and established AD markers.
EDBench: Large-Scale Electron Density Data for Molecular Modeling: This work constructs EDBench, the largest electron density (ED) dataset to date (3.3 million molecules, computed via B3LYP/6-31G** DFT), and designs a three-category benchmark evaluation framework covering prediction, retrieval, and generation tasks. It provides the first systematic assessment of deep learning models' ability to understand and exploit electron density.
EndoBench: A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis: This paper introduces EndoBench, the first comprehensive MLLM evaluation benchmark covering 4 endoscopic scenarios, 12 clinical tasks, and 5 levels of visual prompt granularity, comprising 6,832 clinically validated VQA pairs. Evaluation of 23 MLLMs reveals that commercial models generally outperform open-source and medical-specific counterparts, yet all remain below human expert performance.
Energy Matching: Unifying Flow Matching and Energy-Based Models for Generative Modeling: This paper proposes Energy Matching, which unifies flow matching and energy-based models via a single time-independent scalar potential field: far from the data manifold, the model performs efficient transport along optimal transport paths; near the manifold, it transitions to a Boltzmann equilibrium distribution for likelihood modeling. The method achieves FID 3.34 on CIFAR-10, substantially outperforming existing EBMs by more than 50%.
EWC-Guided Diffusion Replay for Exemplar-Free Continual Learning in Medical Imaging: This paper proposes an exemplar-free continual learning framework that combines class-conditional DDPM diffusion replay with Elastic Weight Consolidation (EWC), achieving an AUROC of 0.851 on MedMNIST v2 (8 tasks across 2D/3D) and CheXpert, reducing forgetting by over 30% compared to DER++, approaching the joint training upper bound (0.869), while requiring no storage of original patient data.
FAPEX: Fractional Amplitude-Phase Expressor for Robust Cross-Subject Seizure Prediction: This paper proposes FAPEX, a framework that achieves adaptive time-frequency decomposition via a learnable Fractional Neural Frame Operator (FrNFO), combined with Amplitude-Phase Cross-Encoding (APCE) and Spatial Correlation Aggregation (SCA). FAPEX comprehensively outperforms 33 baseline methods across 12 cross-species, cross-modality seizure prediction benchmarks.
Few-Shot Learning from Gigapixel Images via Hierarchical Vision-Language Alignment and Modeling: This paper proposes HiVE-MIL, a hierarchical vision-language MIL framework that constructs a unified heterogeneous graph to model cross-scale hierarchical relationships (5× and 20×) and intra-scale multimodal alignment. Combined with a text-guided dynamic filtering mechanism and a hierarchical contrastive loss, HiVE-MIL consistently outperforms existing methods under the 16-shot setting on three TCGA datasets (lung, breast, and renal cancer), achieving up to 4.1% improvement in Macro F1.
FGBench: A Dataset and Benchmark for Molecular Property Reasoning at Functional Group-Level in Large Language Models: This paper presents FGBench, a dataset comprising 625K molecular property reasoning questions focused on functional group-level reasoning evaluation. Through three dimensions (single functional group effect, multi-functional group interaction, and molecular comparison), it systematically reveals the severe deficiencies of current LLMs in fine-grained chemical reasoning.
FireGNN: Neuro-Symbolic Graph Neural Networks with Trainable Fuzzy Rules for Interpretable Medical Image Classification: This paper proposes FireGNN, which for the first time embeds trainable fuzzy rules into the GNN forward pass. Using three topological descriptors—node degree, clustering coefficient, and label consistency—FireGNN achieves endogenous interpretability for medical image classification, outperforming standard GCN/GAT/GIN and auxiliary-task baselines on 5 MedMNIST datasets and MorphoMNIST.
Flow Density Control: Generative Optimization Beyond Entropy-Regularized Fine-Tuning: This paper proposes Flow Density Control (FDC), which generalizes the fine-tuning of pretrained flow/diffusion models from KL-regularized expected reward maximization to a unified framework supporting arbitrary distributional utility functions with arbitrary divergence regularization. The approach decomposes nonlinear objectives into a sequence of linear fine-tuning subproblems and provides convergence guarantees.
FOXES: A Framework For Operational X-ray Emission Synthesis: This paper proposes FOXES, a Vision Transformer-based framework that translates multi-channel solar EUV observation images into soft X-ray (SXR) flux, achieving an overall Pearson correlation of 0.982. The framework lays the groundwork for far-side solar flare detection and the construction of more complete flare catalogs.
Fractional Diffusion Bridge Models: This paper proposes Fractional Diffusion Bridge Models (FDBM), which incorporate fractional Brownian motion (fBM) into the generative diffusion bridge framework. The Hurst exponent $H$ controls the roughness and long-range dependence of trajectories, yielding improvements over Brownian motion baselines on protein conformation prediction and image translation tasks.
From Black Box to Biomarker: Sparse Autoencoders for Interpreting Speech Models of Parkinson's Disease: This work adapts sparse autoencoder (SAE) techniques from large language model interpretability research to speech-based Parkinson's disease (PD) detection, proposes a Mask-based SAE to address small-dataset limitations, discovers that model predictions rely primarily on spectral flux and spectral flatness in low-energy regions, and further reveals that these features correlate significantly with MRI putamen volume—establishing a bridge from internal model representations to clinical biomarkers.
Generalizable, Real-Time Neural Decoding with Hybrid State-Space Models: POSSM proposes a hybrid SSM-attention architecture that combines spike-level tokenization with a recurrent state-space model backbone, achieving generalizable real-time neural decoding with inference speeds up to 9× faster than Transformers while maintaining comparable accuracy.
Generating Multi-Table Time Series EHR from Latent Space with Minimal Preprocessing: This paper proposes RawMed — the first framework to synthesize multi-table time series EHR data from raw records with minimal lossy preprocessing: events are textualized → compressed into a discrete latent space via Residual Quantization → temporal dynamics are modeled with an autoregressive Transformer. RawMed comprehensively outperforms existing baselines in fidelity, clinical utility, and privacy protection.
Generative Distribution Embeddings: Lifting Autoencoders to the Space of Distributions for Multiscale Representation Learning: This paper proposes Generative Distribution Embeddings (GDE), which lifts autoencoders to the space of distributions — the encoder operates on sets of samples while the decoder is replaced by a conditional generative model — thereby learning distribution-level representations. The framework is validated on 6 computational biology tasks.
Generative Modeling of Full-Atom Protein Conformations using Latent Diffusion on Graph Embeddings: This work proposes LD-FPG, a framework that encodes full-atom MD trajectories into a low-dimensional latent space via Chebyshev graph neural networks and applies DDPM in that space to generate novel conformational ensembles. To the authors' knowledge, this is the first approach to generate protein conformations that includes all heavy atoms of the side chains.
GeoDynamics: A Geometric State-Space Neural Network for Understanding Brain Dynamics on Riemannian Manifolds: This paper proposes GeoDynamics, which generalizes the classical state-space model (SSM) from Euclidean space to the symmetric positive definite (SPD) manifold. By employing weighted Fréchet mean aggregation and orthogonal group translations, it achieves geometrically consistent state evolution on the manifold, attaining state-of-the-art performance on brain connectome analysis (early diagnosis of AD/PD/ASD) and human action recognition.
GFlowNets for Learning Better Drug-Drug Interaction Representations: To address the severe class imbalance in drug-drug interaction (DDI) prediction, this paper proposes combining GFlowNet with a variational graph autoencoder (VGAE). By reward-guided generative sampling, the framework synthesizes training samples for rare interaction types, thereby enhancing predictive performance on infrequent yet clinically critical interaction categories.
H-DDx: A Hierarchical Evaluation Framework for Differential Diagnosis: H-DDx proposes a differential diagnosis evaluation framework grounded in the ICD-10 classification hierarchy. By expanding both predicted and ground-truth diagnoses to their ancestor nodes and computing a Hierarchical Diagnostic F1 (HDF1), the framework rewards "clinically relevant approximate correctness" rather than exact match only. Evaluating 22 LLMs reveals that the domain-specialized model MediPhi rises from 20th to 2nd place under HDF1, an advantage completely obscured by Top-5 metrics.
ImageNet-trained CNNs are not biased towards texture: Revisiting feature reliance through controlled suppression: By proposing a systematic feature suppression framework—rather than cue-conflict experiments—this work re-evaluates the feature reliance of CNNs, finding that CNNs are not inherently texture-biased but instead rely primarily on local shape features; moreover, feature reliance patterns differ substantially across domains (CV/MI/RS).
Interpreting GFlowNets for Drug Discovery: Extracting Actionable Insights for Medicinal Chemistry: This work constructs a multi-level interpretability toolkit for SynFlowNet (a GFlowNet grounded in synthetic reaction templates), integrating gradient saliency, counterfactual perturbation, sparse autoencoders (SAE), and motif probes to reveal how internal representations encode physicochemical properties and functional group information relevant to medicinal chemistry.
Is Sequence Information All You Need for Bayesian Optimization of Antibodies?: This paper systematically compares the roles of sequence and structural information in antibody Bayesian optimization, finding that sequence-only methods augmented with protein language model (pLM) soft constraints can match the performance of structure-based methods, thereby questioning the necessity of structural information in antibody Bayesian optimization.
Iterative Foundation Model Fine-Tuning on Multiple Rewards: This paper proposes IterativeRS (Iterative Rewarded Soups), which alternates between independent fine-tuning of per-objective expert policies and policy merging. The method unifies reward combination and expert merging approaches, outperforming MORLHF and Rewarded Soups on small molecule design, DNA sequence generation, and text summarization tasks.
JAMUN: Bridging Smoothed Molecular Dynamics and Score-Based Learning for Conformational Ensembles: This paper proposes JAMUN, a conformational ensemble generation method built on the Walk-Jump Sampling (WJS) framework. By performing Langevin dynamics on a noise-smoothed manifold and using an SE(3)-equivariant denoiser to jump back to the original distribution, JAMUN achieves peptide conformational sampling an order of magnitude faster than conventional molecular dynamics while retaining transferability to out-of-training systems.
JanusDNA: A Powerful Bi-directional Hybrid DNA Foundation Model: JanusDNA is proposed as the first bidirectional DNA foundation model, combining a Mamba-Attention-MoE hybrid architecture with the Janus Modeling pretraining paradigm to achieve bidirectional understanding at the training efficiency of autoregressive methods, attaining state-of-the-art performance across multiple genomic benchmarks.
Large Language Models as Medical Codes Selectors: A Benchmark Using the International Classification of Primary Care: This work constructs a medical coding benchmark based on an extract-retrieve-select framework, evaluating ICPC-2 code selection capability across 33 LLMs. Results show that 28 models achieve F1 > 0.8, demonstrating that LLMs can effectively automate primary care coding without fine-tuning.
Learning Conformational Ensembles of Proteins Based on Backbone Geometry: This paper proposes BBFlow, a flow matching generative model based on protein backbone geometry for conformational ensemble sampling. BBFlow requires neither evolutionary sequence information nor pretrained folding models, achieves inference speeds more than an order of magnitude faster than AlphaFlow, and generalizes to multi-chain proteins.
Learning Relative Gene Expression Trends from Pathology Images in Spatial Transcriptomics: This paper proposes STRank, a loss function that reformulates gene expression estimation from pathology images as a ranking score estimation task. By modeling the stochastic noise inherent in expression counts via binomial/multinomial distributions, STRank enables models to learn robust relative expression relationships from spatial transcriptomics data subject to batch effects and random fluctuations.
LLM-Assisted Emergency Triage Benchmark: Bridging Hospital-Rich and MCI-Like Field Simulation: This work constructs an open, LLM-assisted emergency triage benchmark based on MIMIC-IV-ED, defining two evaluation scenarios—hospital-rich and mass casualty incident (MCI)-like field simulation—and providing baseline models along with SHAP-based interpretability analysis to promote reproducibility and accessibility in triage prediction research.
LoMix: Learnable Weighted Multi-Scale Logits Mixing for Medical Image Segmentation: LoMix introduces a Combinatorial Mutation Module (CMM) that generates "mutant" logits from multi-scale outputs via four fusion operators (addition / multiplication / concatenation / attention-weighted fusion) across all subset combinations, paired with NAS-style Softplus learnable weights for automatic contribution balancing. On Synapse 8-organ segmentation, Dice improves from 80.9% to 85.1% (+4.2%), and by +9.23% under 5% training data.
Magical: Medical Lay Language Generation via Semantic Invariance and Layperson-tailored Adaptation: This paper proposes Magical, an asymmetric LoRA architecture for medical lay language generation (MLLG) that enforces a semantic invariance constraint on the shared matrix $A$ while employing multiple independent matrices $B$ to enable semantically faithful and stylistically diverse lay language generation. Magical reduces trainable parameters by 31.66% while outperforming all LoRA variants.
Mamba Goes HoME: Hierarchical Soft Mixture-of-Experts for 3D Medical Image Segmentation: This paper proposes Mamba-HoME, an architecture that integrates a Hierarchical Soft Mixture-of-Experts (HoME) with the Mamba SSM. Through a two-level token routing mechanism, it achieves local-to-global feature modeling and surpasses existing state-of-the-art methods on 3D medical image segmentation across CT, MRI, and ultrasound modalities, while maintaining linear computational complexity.
Manipulating 3D Molecules in a Fixed-Dimensional E(3)-Equivariant Latent Space: This paper proposes MolFLAE, a 3D molecular variational autoencoder that learns a fixed-dimensional, E(3)-equivariant latent space. By introducing learnable virtual nodes and a Bayesian Flow Network (BFN) decoder, MolFLAE enables zero-shot molecular editing — including atom-count editing, structural reconstruction, and property interpolation — and demonstrates practical utility in drug optimization targeting the human glucocorticoid receptor (hGR).
MATCH: Multi-faceted Adaptive Topo-Consistency for Semi-Supervised Histopathology Segmentation: This paper proposes the MATCH framework, which tightly couples topological reasoning with the perturbation-robustness principle of semi-supervised learning. By exploiting dual-level topological consistency across random perturbations and temporal training snapshots, MATCH adaptively identifies reliable topological structures without requiring manually defined thresholds, substantially reducing topological errors in histopathology image segmentation.
MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks: This paper proposes MedAgentBoard, a comprehensive benchmark that systematically evaluates multi-agent collaboration, single-LLM, and conventional methods across diverse medical tasks, revealing that multi-agent collaboration does not consistently outperform strong single models or specialized conventional approaches.
MedMKG: Benchmarking Medical Knowledge Exploitation with Multimodal Knowledge Graph: This paper constructs MedMKG, a medical multimodal knowledge graph that integrates MIMIC-CXR imaging data with UMLS clinical concepts, proposes a Neighbor-aware Filtering (NaF) algorithm for image selection, and conducts comprehensive benchmarking of 24 baseline methods across three tasks: link prediction, text-image retrieval, and VQA.
Mind the (Data) Gap: Evaluating Vision Systems in Small Data Applications: This paper systematically compares MLLMs (e.g., Gemini, Qwen2.5-VL) and vision encoder + SVM pipelines on the NeWT ecological classification benchmark across the "small data regime" (10–1000 labeled samples). MLLMs plateau after 10–30 samples, whereas vision-based methods exhibit near-logarithmic growth throughout, calling on the community to prioritize small-data evaluation.
Mind the Gap: Aligning Knowledge Bases with User Needs to Enhance Mental Health Retrieval: This paper proposes a knowledge base augmentation framework grounded in "demand gap" analysis. By overlaying real user data (forum posts) onto existing mental health resource repositories to identify content voids, the framework applies targeted augmentation strategies to achieve near-full-corpus RAG retrieval quality with minimal document additions.
MIRA: Medical Time Series Foundation Model for Real-World Health Data: This paper presents MIRA, a foundation model specifically designed for irregular medical time series. Through continuous-time rotary position encoding (CT-RoPE), frequency-specific Mixture-of-Experts (MoE), and a Neural ODE-based extrapolation module, MIRA is pretrained on 454 billion observation points and achieves zero-shot forecasting performance that reduces average error by 8% and 6% in OOD and in-distribution (ID) settings, respectively.
Modeling X-ray Photon Pile-up with a Normalizing Flow: This paper proposes a Simulation-Based Inference (SBI) framework based on Normalizing Flows. A CNN extracts spatially resolved X-ray spectral features, which are then passed to a neural spline flow to perform accurate posterior estimation of astrophysical source parameters in the presence of photon pile-up, substantially outperforming the conventional PSF-core excision approach.
Mol-LLaMA: Towards General Understanding of Molecules in Large Molecular Language Models: This paper proposes Mol-LLaMA, a large molecular language model for general molecular understanding. By designing three types of instruction data and a 2D-3D molecular representation fusion module, Mol-LLaMA surpasses GPT-4o in molecular feature understanding while exhibiting interpretability and reasoning capabilities.
MTBBench: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology: This paper introduces MTBBench—the first clinical benchmark simultaneously covering three dimensions: multimodality, longitudinal temporal sequencing, and interactive agent workflows. It simulates the decision-making process of Molecular Tumor Boards (MTBs) to evaluate and enhance the multimodal longitudinal reasoning capabilities of AI agents in precision oncology.
Multi-Objective Reinforcement Learning with Max-Min Criterion: A Game-Theoretic Approach: This paper reformulates max-min multi-objective reinforcement learning as a two-player zero-sum regularized continuous game, proposes the ERAM/ARAM algorithms, and leverages mirror descent to achieve a concise closed-form weight update. The approach guarantees global last-iterate convergence and significantly outperforms existing methods on tasks such as traffic signal control.
Multimodal 3D Genome Pre-training: This paper proposes MIX-HIC — the first multimodal foundation model for 3D genomics — which integrates Hi-C contact maps and epigenomic signals via cross-modal interaction blocks and cross-modal mapping blocks. Pre-trained on over 1.27 million paired samples, MIX-HIC achieves state-of-the-art performance across three downstream tasks: Hi-C prediction, chromatin loop detection, and CAGE-seq expression prediction.
Multimodal Bayesian Network for Robust Assessment of Casualties in Autonomous Triage: This paper proposes an expert-knowledge-driven Bayesian network decision-support framework that fuses outputs from multiple computer vision models to assess casualty conditions. Requiring no training data and supporting inference under incomplete information, the framework improved triage accuracy from 14% to 53% and diagnostic coverage from 31% to 95% in the DARPA Triage Challenge.
Multimodal Disease Progression Modeling via Spatiotemporal Disentanglement and Multiscale Alignment: This paper proposes DiPro, a framework that addresses redundancy in longitudinal chest X-ray sequences and cross-modal temporal misalignment through region-aware spatiotemporal disentanglement (separating static anatomical from dynamic pathological features) and multiscale alignment (local–global fusion of CXR and EHR), achieving state-of-the-art performance on disease progression recognition and ICU prediction tasks.
Multiscale Guidance of Protein Structure Prediction with Heterogeneous Cryo-EM Data: CryoBoltz leverages cryo-EM density maps to guide the sampling trajectory of a pretrained diffusion-based structure prediction model (Boltz-1) via a multiscale guidance mechanism (global → local), generating multi-conformational atomic models consistent with experimental data without any retraining.
MuSLR: Multimodal Symbolic Logical Reasoning: This paper introduces MuSLR, the first multimodal symbolic logical reasoning task, along with its benchmark MuSLR-Bench (1,093 instances spanning 7 domains, 35 atomic symbolic logic rules, and reasoning depths of 2–9). It further proposes LogiCAM, a modular framework comprising premise selection, reasoning type identification, and symbolic reasoning modules, which improves GPT-4.1's CoT performance by 14.13%.
NeurIPT: Foundation Model for Neural Interfaces: NeurIPT is an EEG foundation model for diverse brain–computer interface (BCI) applications. Through four key innovations—Amplitude-Aware Masking Pre-training (AAMP), Progressive Mixture-of-Experts (PMoE) architecture, 3D electrode spatial encoding, and Intra- and Inter-Lobe Pooling (IILP)—it achieves state-of-the-art performance across eight downstream BCI tasks.
One Small Step with Fingerprints, One Giant Leap for De Novo Molecule Generation from Mass Spectra: By employing MIST as a spectra-to-fingerprint encoder and MolForge as a fingerprint-to-structure decoder, combined with a prior-adjusted thresholding strategy, this work achieves a tenfold performance improvement on the MassSpecGym benchmark for de novo molecular structure generation from mass spectra (top-1 accuracy from 2.3% to 31%).
Online Feedback Efficient Active Target Discovery in Partially Observable Environments: This paper proposes DiffATD, which leverages the reverse process of diffusion models to construct a belief distribution for balancing exploration and exploitation, enabling efficient target region discovery in partially observable environments without any supervised training. The framework is applicable across multiple domains including medical imaging, species discovery, and remote sensing.
Ordinal Label-Distribution Learning with Constrained Asymmetric Priors for Imbalanced Retinal Grading: This paper proposes CAP-WAE (Constrained Asymmetric Prior Wasserstein Autoencoder), which addresses the challenges of long-tailed distribution and ordinal structure in diabetic retinopathy (DR) grading through three innovations: asymmetric priors, a margin-aware orthogonality and compactness loss, and a direction-aware ordinal loss, achieving state-of-the-art performance on multiple DR benchmarks.
Orochi: Versatile Biomedical Image Processor: This paper proposes Orochi—the first general-purpose foundation model for low-level biomedical image processing. Through Task-related Joint-embedding Pre-training (TJP) and a Multi-head Hierarchy Mamba architecture, Orochi matches or surpasses task-specific state-of-the-art models across four tasks—registration, fusion, restoration, and super-resolution—with lightweight fine-tuning of fewer than 5% of parameters.
Pancakes: Consistent Multi-Protocol Image Segmentation Across Biomedical Domains: This paper proposes the Pancakes framework, which, given a collection of biomedical images from an unseen domain, automatically generates label maps for multiple plausible segmentation protocols, ensuring semantic consistency across images within the same protocol—i.e., the same label refers to the same anatomical structure across all images.
PatientSim: A Persona-Driven Simulator for Realistic Doctor-Patient Interactions: This paper introduces PatientSim — an LLM-based patient simulator grounded in real MIMIC clinical data and a four-dimensional persona framework (personality, language proficiency, medical history recall, and cognitive confusion), generating 37 unique persona combinations. The system is evaluated across 8 LLMs for factual accuracy and persona fidelity, and validated by 4 clinical experts with a mean quality score of 3.89/4.
Pharmacophore-Guided Generative Design of Novel Drug-Like Molecules: This paper proposes a pharmacophore-guided molecular generation framework that simultaneously maximizes pharmacophore similarity and minimizes structural similarity within the reward function of a reinforcement learning model (FREED++), generating candidate drug molecules that retain bioactivity features while exhibiting high structural novelty.
PhysioWave: A Multi-Scale Wavelet-Transformer for Physiological Signal Representation: This paper proposes PhysioWave, a multi-scale Transformer architecture based on learnable wavelet decomposition and frequency-guided masking. It establishes, for the first time, large-scale pretrained foundation models for EMG and ECG, and achieves state-of-the-art performance on both unimodal and multimodal physiological signal tasks through a multimodal fusion framework.
PolyPose: Deformable 2D/3D Registration via Polyrigid Transformations: This paper presents PolyPose, a deformable 2D/3D registration method based on polyrigid transformations. Leveraging the anatomical prior that bones are rigid bodies, PolyPose parameterizes complex 3D deformation fields as weighted combinations of multiple rigid transformations in the Lie algebra $\mathfrak{se}(3)$, enabling accurate 3D volumetric registration from as few as two X-ray images without any regularization or hyperparameter tuning.
Position: Thematic Analysis of Unstructured Clinical Transcripts with Large Language Models: This position paper systematically reviews the current state of LLM-assisted thematic analysis (TA) on unstructured clinical transcripts, identifies highly fragmented evaluation practices across the literature, and proposes a standardized evaluation framework centered on three dimensions: Validity, Reliability, and Interpretability.
Posterior Sampling by Combining Diffusion Models with Annealed Langevin Dynamics: This paper proposes an algorithm combining diffusion models with annealed Langevin dynamics that requires only $L^4$-accurate score estimates to achieve polynomial-time posterior sampling under (locally) log-concave distributions, providing the first theoretical guarantees for warm-started inverse problem solving.
Prior-Guided Flow Matching for Target-Aware Molecule Design with Learnable Atom Number: This paper proposes PAFlow, a 3D molecule generation model built on the flow matching framework, which guides the vector field via a protein–ligand interaction predictor and determines atom counts through a learnable atom number predictor. PAFlow achieves a new state-of-the-art Avg. Vina Score of −8.31 on CrossDocked2020, substantially outperforming existing methods.
PROSPERO: Active Learning for Robust Protein Design Beyond Wild-Type Neighborhood: This paper proposes ProSpero, an active learning framework that discovers high-fitness and novel protein sequences even under surrogate model mismatch, via inference-time sampling of a frozen pretrained generative model (EvoDiff) guided by a surrogate, a targeted masking strategy, and biologically-constrained SMC sampling.
Protein Design with Dynamic Protein Vocabulary: ProDVa introduces natural protein fragments as a "dynamic vocabulary" for generative protein design, employing a three-component architecture consisting of a text encoder, a protein language model, and a fragment encoder. Using less than 0.04% of the training data required by prior work, ProDVa designs functionally aligned and structurally foldable protein sequences, surpassing the SOTA model Pinal by 7.38% on the pLDDT>70 ratio.
QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training: QoQ-Med constructs a multimodal clinical foundation model spanning 9 clinical modalities (1D ECG + 6 types of 2D images + 2 types of 3D scans), and proposes Domain-aware Relative Policy Optimization (DRPO)—which employs hierarchical temperature scaling (inter-domain × intra-domain K-means clustering) to address modality/difficulty imbalance. Trained on 2.61 million instruction-tuning pairs, it achieves an average F1 of 0.295 (vs. GRPO 0.193, +52.8%), ranking best in 6 out of 8 modalities.
Quantifying the Role of OpenFold Components in Protein Structure Prediction: This paper proposes a systematic methodology for evaluating the contribution of individual Evoformer components in OpenFold/AlphaFold2 to protein structure prediction accuracy. The study finds that MSA column attention and MLP Transition layers are the most critical components, and that the importance of multiple components is significantly correlated with protein sequence length.
RAD: Towards Trustworthy Retrieval-Augmented Multi-modal Clinical Diagnosis: This paper proposes RAD, a retrieval-augmented diagnostic framework that retrieves disease guidelines from multi-source medical corpora and injects them throughout the full pipeline of a multimodal model — from feature extraction to cross-modal fusion. A dual-axis explainability evaluation protocol is also introduced. RAD achieves state-of-the-art performance on four datasets spanning distinct anatomical regions.
RadZero: Similarity-Based Cross-Attention for Explainable Vision-Language Alignment in Chest X-ray: This paper proposes RadZero, a framework centered on VL-CABS (Vision-Language Cross-Attention Based on Similarity), enabling explainable and fine-grained vision-language alignment on chest X-rays with unified support for zero-shot classification, localization, and segmentation.
RAM-W600: A Multi-Task Wrist Dataset and Benchmark for Rheumatoid Arthritis: RAM-W600 is the first publicly available multi-task wrist conventional radiograph dataset, comprising 1,048 images and supporting two clinically relevant tasks: carpal bone instance segmentation and SvdH bone erosion (BE) scoring, accompanied by comprehensive benchmarking.
Random Search Neural Networks for Efficient and Expressive Graph Learning: This paper proposes Random Search Neural Networks (RSNN), which replace random walks with randomized depth-first search (DFS) for graph structure sampling. On sparse graphs, RSNN achieves complete edge coverage with only $O(\log|V|)$ searches. Paired with a universal sequence model, RSNN attains universal approximation capability, and consistently outperforms RWNN on molecular and protein benchmarks using up to 16× fewer samples.
RAxSS: Retrieval-Augmented Sparse Sampling for Explainable Variable-Length Medical Time Series Classification: This paper proposes RAxSS, a framework that integrates retrieval augmentation into the random sparse sampling (SSS) pipeline. By replacing uniform averaging with intra-window similarity-weighted aggregation, RAxSS maintains competitive performance on variable-length medical time series classification while providing an interpretable evidence chain spanning from "where" to "why."
Revisiting End-to-End Learning with Slide-level Supervision in Computational Pathology: This paper revisits end-to-end (E2E) learning with slide-level supervision in computational pathology, and is the first to identify optimization difficulties induced by sparse-attention MIL under E2E training. It proposes ABMILX, which addresses this issue via multi-head attention and a global attention correction module, enabling E2E-trained ResNets to surpass state-of-the-art foundation models on multiple benchmarks.
Robust or Suggestible? Exploring Non-Clinical Induction in LLM Drug-Safety Decisions: Through a persona-based evaluation framework, this paper finds that ChatGPT-4o and Bio-Medical-Llama-3-8B are systematically influenced by clinically irrelevant sociodemographic attributes (education, insurance, housing, etc.) in adverse drug event prediction, exhibiting both explicit and implicit bias patterns.
Scaling Laws and Pathologies of Single-Layer PINNs: Network Width and PDE Nonlinearity: This work establishes empirical scaling laws for single-layer PINNs on representative nonlinear PDEs, identifying a dual optimization failure: a width-scaling pathology (error does not decrease with width) and a compound pathology (nonlinearity exacerbates this failure), demonstrating that optimization rather than approximation capacity is the primary bottleneck.
Securing the Language of Life: Inheritable Watermarks from DNA Language Models to Proteins: This paper proposes DNAMark and CentralMark, two watermarking schemes for embedding robust watermarks in sequences generated by DNA language models. DNAMark achieves function-preserving watermarks via synonymous codon substitution, while CentralMark realizes inheritable watermarks that propagate from DNA to protein through the central dogma.
Self-supervised Learning of Echocardiographic Video Representations via Online Cluster Distillation: This paper proposes DISCOVR, a self-supervised dual-branch framework that transfers fine-grained spatial semantics from an image encoder to the temporal representations of a video encoder via online semantic cluster distillation, achieving state-of-the-art performance across six cross-population cardiac ultrasound datasets on anomaly detection, classification, and segmentation tasks.
Self-Supervised Learning via Flow-Guided Neural Operator on Time-Series Data: This paper proposes FGNO (Flow-Guided Neural Operator), which combines Flow Matching with operator learning for self-supervised pre-training on time-series data. By leveraging STFT for resolution-invariant function-space learning and treating flow time and network layer depth as adjustable "knobs" for controlling feature granularity, FGNO substantially outperforms baselines such as MAE on biomedical tasks.
Self Iterative Label Refinement via Robust Unlabeled Learning: This paper proposes an iterative pipeline that leverages a robust unlabeled-unlabeled (UU) learning framework to refine LLM-generated pseudo-labels, surpassing the self-refinement approaches of GPT-4o and DeepSeek-R1 on both classification and generative safety alignment tasks with minimal human annotation.
Semantic and Visual Crop-Guided Diffusion Models for Heterogeneous Tissue Synthesis in Histopathology: This paper proposes HeteroTissue-Diffuse (HTD), a dual-conditioned Latent Diffusion Model that generates heterogeneous pathology images by simultaneously conditioning on semantic segmentation maps and real tissue crops (visual crops). On Camelyon16, the method reduces Fréchet Distance from 430 to 72 (a 6× improvement). DeepLabv3+ segmentation IoU trained on synthetic data falls within 1–2% of models trained on real data. The approach is further extended to 11,765 unannotated TCGA whole-slide images via self-supervised clustering.
Sequential Attention-based Sampling for Histopathological Analysis: This paper proposes SASHA, a framework integrating a Hierarchical Attention-based Feature Distillation (HAFED) module with deep reinforcement learning (RL). By sampling only 10–20% of high-resolution patches, SASHA achieves classification performance on par with full-resolution SOTA methods, while yielding a 4–8× inference speedup and a WSI compression ratio exceeding 16×.
Shallow Robustness, Deep Vulnerabilities: Multi-Turn Evaluation of Medical LLMs: This paper proposes the MedQA-Followup framework to systematically evaluate the multi-turn robustness of medical LLMs. It reveals that models exhibit acceptable performance under single-turn perturbations (shallow robustness), yet accuracy can catastrophically drop from 91.2% to 13.5% under multi-turn follow-up challenges (deep vulnerability). Notably, indirect contextual manipulation proves more destructive than direct incorrect suggestions.
SMMILE: An Expert-Driven Benchmark for Multimodal Medical In-Context Learning: This paper presents SMMILE — the first expert-driven benchmark for multimodal medical in-context learning (ICL), comprising 111 questions (517 image-text QA triplets) spanning 6 medical specialties and 13 imaging modalities, constructed by 11 clinical experts. The benchmark systematically exposes critical deficiencies of current MLLMs in medical multimodal ICL and reveals the pivotal impact of in-context example quality and ordering on model performance.
SpecMER: Fast Protein Generation with K-mer Guided Speculative Decoding: SpecMER introduces speculative decoding into protein sequence generation, employing a K-mer-guided batch selection strategy to choose the candidate most consistent with evolutionary conservation from multiple draft model outputs for target model verification. It achieves 24–32% speedup while preserving distributional consistency, and the generated sequences demonstrate significantly improved NLL and pLDDT structural confidence scores compared to unguided baselines.
STAMP: Spatial-Temporal Adapter with Multi-Head Pooling: STAMP introduces a lightweight spatial-temporal adapter with only 750K parameters for Time Series Foundation Models (TSFMs). Through three sets of positional encodings (token/spatial/temporal), cross-gated MLP mixing, and multi-head attention pooling, it enables a frozen TSFM (e.g., MOMENT 385M) to compete with or surpass EEG-specific models with 29M parameters (CBraMod) across 8 EEG datasets, achieving 193% higher Kappa than CBraMod on BCIC-IV-2a.
STARC-9: A Large-scale Dataset for Multi-Class Tissue Classification for CRC Histopathology: This paper introduces STARC-9, a large-scale colorectal cancer (CRC) tissue classification dataset comprising 630K patches across 9 tissue classes, along with its construction framework DeepCluster++. The framework combines domain-specific autoencoder feature extraction, K-means clustering, and equal-frequency binning sampling to ensure morphological diversity. Models trained on STARC-9 significantly outperform those trained on NCT and HMU.
Steering Generative Models with Experimental Data for Protein Fitness Optimization: This work systematically evaluates strategies for steering protein generative models (discrete diffusion models and language models) toward fitness optimization, finding that plug-and-play guidance methods using small labeled datasets (~200 samples)—particularly DAPS—outperform RL-based fine-tuning, and proposes a Thompson sampling strategy incorporating predictive uncertainty for adaptive optimization.
Surf2CT: Cascaded 3D Flow Matching Models for Torso 3D CT Synthesis from Skin Surface: This paper proposes Surf2CT, a cascaded 3D Flow Matching framework that, for the first time, synthesizes complete high-resolution 3D CT volumes solely from external body surface scans and demographic data (age, sex, height, weight), without requiring any internal imaging input.
The Biased Oracle: Assessing LLMs' Understandability and Empathy in Medical Diagnoses: This work systematically evaluates GPT-4o and Claude-3.7 on readability and empathy in medical diagnostic communication. Both models produce reading levels well above recommended standards (grades 9–13 vs. the recommended grades 6–8). Affective empathy varies significantly with diagnosis type and patient education level, and LLM-as-Judge exhibits severe self-serving bias (GPT inflates its own empathy scores by ~0.3 points).
The Boundaries of Fair AI in Medical Image Prognosis: A Causal Perspective: FairTTE is the first comprehensive framework to systematically investigate fairness in time-to-event (TTE) prediction for medical imaging. It leverages causal analysis to quantify five sources of bias, and through training over 20,000 models, reveals the limitations of existing fairness methods — particularly the fundamental challenge of maintaining fairness under distribution shift.
THUNDER: Tile-level Histopathology image UNDERstanding benchmark: This paper presents THUNDER, a comprehensive tile-level benchmark for digital pathology foundation models, enabling efficient comparison of 23 foundation models across 16 datasets, covering downstream task performance, feature space analysis, robustness, and uncertainty estimation.
Toward a Vision-Language Foundation Model for Medical Data: Multimodal Dataset and Benchmarks for Vietnamese PET/CT Report Generation: This work introduces ViMed-PET, the first Vietnamese PET/CT image-report dataset comprising 2,757 whole-body PET/CT volumes paired with complete clinical reports. Through a data augmentation strategy and a three-stage fine-tuning pipeline, the approach substantially improves VLM performance on medical report generation and VQA tasks. Novel evaluation metrics based on clinically critical information are also proposed.
Towards Multiscale Graph-based Protein Learning with Geometric Secondary Structural Motifs: This paper proposes SSHG (Secondary Structure-based Hierarchical Graph), a framework that constructs two-level hierarchical graph representations from protein secondary structure motifs — an intra-motif residue-level graph and an inter-motif global graph — and employs a two-stage GNN to learn local and global features respectively. Theoretical guarantees of maximal expressiveness are provided, with empirical improvements in both accuracy and computational efficiency on enzyme classification and ligand affinity prediction tasks.
Towards Self-Supervised Foundation Models for Critical Care Time Series: A self-supervised foundation model for critical care time series is constructed by pre-training a Biaxial Transformer (BAT) architecture on multiple ICU datasets, substantially outperforming supervised baselines in low-data regimes.
Towards Unified and Lossless Latent Space for 3D Molecular Latent Diffusion Modeling: This paper proposes UAE-3D, a multimodal variational autoencoder that compresses atomic types, chemical bonds, and 3D coordinates of molecules into a unified, near-lossless latent space. By eliminating the complexity of handling multimodality and equivariance, a general-purpose Diffusion Transformer achieves state-of-the-art 3D molecular generation.
Uncertainty-Aware Multi-Objective Reinforcement Learning-Guided Diffusion Models for 3D De Novo Molecular Design: This paper proposes an uncertainty-aware multi-objective reinforcement learning framework that guides a 3D molecular diffusion model (EDM) to simultaneously optimize drug-likeness (QED), synthetic accessibility (SAS), and binding affinity. The framework dynamically shapes the reward function using predictive uncertainty from surrogate models, consistently outperforms baselines across three benchmark datasets, and validates candidate molecules through molecular dynamics simulations and ADMET analysis.
Unified All-Atom Molecule Generation with Neural Fields: This paper proposes FuncBind, a framework that represents molecules as continuous atomic density functions via neural fields, constructing a unified conditional generative model capable of target-conditioned generation across three drug modalities: small molecules, macrocyclic peptides, and antibody CDR loops.
UniMRSeg: Unified Modality-Relax Segmentation via Hierarchical Self-Supervised Compensation: This paper proposes UniMRSeg, a unified missing-modality segmentation framework that employs a Hierarchical Self-Supervised Compensation (HSSC) mechanism—spanning input-level modality reconstruction, feature-level contrastive learning, and output-level consistency regularization—to achieve optimal average performance and minimal performance variance across all possible modality combinations using 100% shared parameters.
UniSite: The First Cross-Structure Dataset and Learning Framework for End-to-End Ligand Binding Site Detection: This work introduces UniSite-DS, the first UniProt (unique protein)-centric ligand binding site dataset, and UniSite, the first end-to-end binding site detection framework. UniSite directly predicts multiple potentially overlapping binding sites via set prediction loss and bijective matching, and further proposes IoU-based AP as a more accurate evaluation metric.
Unlearned but Not Forgotten: Data Extraction after Exact Unlearning in LLM: This paper reveals that even exact unlearning (retraining from scratch to remove data influence) is susceptible to privacy leakage. By exploiting the divergence between model checkpoints before and after unlearning, an adversary can apply reversed model guidance with token filtering to substantially improve extraction success rates for deleted data—in some settings doubling the extraction rate.
Unpaired Image-to-Image Translation for Segmentation and Signal Unmixing: This paper proposes Ui2i, a model built upon CycleGAN that achieves high content-fidelity unpaired image-to-image translation through four key innovations: a UNet-based generator, approximate bidirectional spectral normalization (ABSN) as a replacement for feature normalization, channel-spatial attention, and scale augmentation. The model is successfully applied to two biomedical tasks: IHC→H&E domain adaptation for nucleus segmentation and single-channel immunofluorescence signal unmixing.
Variational Autoencoder with Normalizing Flow for X-ray Spectral Fitting: This work embeds a Normalizing Flow (NF) into an autoencoder architecture to enable fast physical parameter inference and full posterior distribution estimation for NICER spectral data of black hole X-ray binaries, achieving approximately 2000× speedup over traditional MCMC methods while maintaining comparable accuracy.
VQ-Seg: Vector-Quantized Token Perturbation for Semi-Supervised Medical Image Segmentation: VQ-Seg is proposed as the first method to introduce vector quantization into semi-supervised medical image segmentation. A Quantization Perturbation Module (QPM) replaces conventional dropout to achieve more controllable feature perturbation, complemented by a dual-branch architecture and foundation-model-guided alignment to compensate for quantization information loss.
Why Masking Diffusion Works: Condition on the Jump Schedule for Improved Discrete Diffusion: This paper reveals the fundamental reason for the superiority of masking diffusion models — they implicitly condition on the known jump-time distribution — and proposes the Schedule-Conditioned Diffusion (SCUD) framework, which generalizes this advantage to arbitrary discrete diffusion models. Combined with structured forward processes, SCUD surpasses masking diffusion on both image and protein generation tasks.

📦 Model Compression¶

4DGCPro: Efficient Hierarchical 4D Gaussian Compression for Progressive Volumetric Video Streaming: This paper proposes 4DGCPro, a hierarchical 4D Gaussian compression framework that achieves multi-bitrate progressive volumetric video streaming within a single model, via perception-weighted hierarchical Gaussian representation, motion-aware adaptive grouping, and end-to-end entropy-optimized training. The framework supports real-time decoding and rendering on mobile devices and surpasses existing SOTA in rate-distortion performance.
A*-Thought: Efficient Reasoning via Bidirectional Compression for Low-Resource Settings: This paper proposes A-Thought, a CoT compression framework based on the A search algorithm. It introduces Bidirectional Importance Scoring (BIS) to measure each reasoning step's relevance to both the question and the answer, and combines path-level A search to efficiently identify the most compact valid reasoning path within an exponentially large search space. Under a 512-token budget, A-Thought improves QwQ-32B accuracy by 2.39×; under a 4096-token budget, it reduces output tokens by approximately 50% with negligible accuracy loss.
A Granular Study of Safety Pretraining under Model Abliteration: This paper systematically investigates the effects of model abliteration—a inference-time activation space editing attack—on various data-driven safety pretraining stages. It finds that safety mechanisms relying solely on refusal training are highly vulnerable, whereas combining multiple safety signals (safe-only filtering + rephrasing + metatags + refusals) distributes safety behavior across a broader representational space, making it substantially more resistant to single-direction projection removal.
A Partition Cover Approach for Tokenization: This paper reformulates tokenization as a partition cover optimization problem, proves it NP-hard, and proposes a polynomial-time greedy algorithm GreedTok that outperforms BPE in both compression rate and downstream tasks when pretraining a 1B-parameter LLM.
A Simple Linear Patch Revives Layer-Pruned Large Language Models: LinearPatch inserts a lightweight symmetric matrix — fusing a Hadamard transform with channel scaling — at the pruning interface to repair activation magnitude mismatches caused by layer pruning. On LLaMA-3-8B, it retains 94.15% of the original performance without any training, and reaches 95.16% after 30 minutes of distillation.
A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone: This paper proposes Low-Rank Clone (LRC), which compresses teacher weights into student weights via learnable low-rank projection matrices (soft pruning), while aligning intermediate activations of both attention and FFN modules (activation cloning). A 1.7B model trained on only 20B tokens surpasses Qwen3-1.7B trained on 36T tokens (64.98 vs. 63.17), achieving a 1,000× improvement in training efficiency.
Accurate and Efficient Low-Rank Model Merging in Core Space: This paper proposes the Core Space Merging framework, which performs model merging within a common reference basis space constructed from low-rank LoRA matrices. This approach losslessly reduces the merging operation from the full $m \times n$ space to a compact $Tr \times Tr$ space (where $T$ is the number of tasks and $r$ is the LoRA rank), achieving state-of-the-art merging accuracy on Llama 3 8B while reducing computational cost by several orders of magnitude.
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference: Existing KV cache eviction methods uniformly allocate budgets across all attention heads, ignoring the substantial variation in attention concentration across heads. This paper proposes Ada-KV — the first head-wise adaptive budget allocation strategy — which redistributes budget from sparse heads to dispersed heads. It provides a theoretical proof that the approach minimizes an upper bound on eviction loss, and serves as a plug-and-play improvement over existing methods across 29 datasets.
Adaptive Prediction-Powered AutoEval with Reliability and Efficiency Guarantees: This paper proposes the R-AutoEval+ framework, which introduces an adaptive weighting mechanism within the testing-by-betting framework to dynamically regulate reliance on LLM-judge-generated synthetic data. It is the first method to simultaneously guarantee evaluation reliability and sampling efficiency no worse than approaches using only real data under finite samples, validated across three scenarios: LLM quantization, prompt selection, and inference budget allocation.
Adaptive Stochastic Coefficients for Accelerating Diffusion Sampling: By theoretically analyzing the complementary weaknesses of ODE and SDE solvers (ODE solvers accumulate irreducible gradient errors; SDE solvers amplify discretization errors at large step sizes), this paper proposes AdaSDE—a method that introduces a learnable stochastic coefficient $\gamma_i$ at each denoising step to control noise injection intensity. Optimized via lightweight distillation, AdaSDE achieves state-of-the-art FID of 4.18 on CIFAR-10 and 8.05 on FFHQ at 5 NFE.
AdmTree: Compressing Lengthy Context with Adaptive Semantic Trees: This paper proposes AdmTree — an adaptive hierarchical context compression framework that constructs leaf gist tokens via information-density-driven dynamic segmentation, then aggregates them bottom-up into a binary semantic tree to achieve multi-granularity semantic preservation. It addresses two fundamental challenges: local detail loss in explicit methods and positional bias in implicit methods, outperforming the SOTA baseline Activation Beacon by over 10% on LongBench.
AI-Generated Video Detection via Perceptual Straightening: This paper proposes ReStraV, a method grounded in the perceptual straightening hypothesis—which posits that real videos form straighter trajectories in neural representation space—to detect AI-generated videos. Using temporal curvature and step-size statistics extracted from DINOv2 feature space, a lightweight classifier is trained to distinguish real from generated content, achieving 97.17% accuracy and 98.63% AUROC on VidProM with only ~48ms inference time.
All You Need is One: Capsule Prompt Tuning with a Single Vector: This paper proposes Capsule Prompt-Tuning (CaPT), identifying that existing task-aware soft prompts exhibit minimal interaction with input tokens — an "attention island" phenomenon. Incorporating instance-aware information into a single capsule prompt enables it to serve as an "attention anchor" that activates attention toward critical structural information, achieving superior performance over multi-prompt methods with extremely few parameters (e.g., only 0.003% of parameters on Llama3.2-1B).
ATLAS: Autoformalizing Theorems through Lifting, Augmentation, and Synthesis of Data: ATLAS proposes a data generation framework based on a concept repository, expert iteration with knowledge distillation, and two novel augmentation strategies. It constructs a parallel corpus of 117K theorem statements, and achieves SOTA on all autoformalization benchmarks after fine-tuning Llama3.1-8B-Instruct.
AutoJudge: Judge Decoding Without Manual Annotation: AutoJudge automates the annotation of "critical tokens" in Judge Decoding — by using a semi-greedy search to replace mismatched tokens and checking whether the final answer changes, it labels token importance, trains a logistic regression classifier to predict importance at inference time, enabling speculative decoding to accept 40+ tokens per round (vs. ~20 in standard methods), achieving 1.5× speedup on GSM8K with less than 1% accuracy loss.
BaRISTA: Brain-Scale Informed Spatiotemporal Representation of Human Intracranial EEG: BaRISTA systematically investigates spatial encoding scales (electrode/parcel/lobe) for iEEG Transformers, finding that atlas parcel-level encoding combined with spatial masked reconstruction achieves 86.2% AUC on language task decoding (vs. PopT 79.5%). The choice of encoding scale has greater impact than masking strategy, and the model generalizes well across subjects.
Benford's Curse: Tracing Digit Bias to Numerical Hallucination in LLMs: This paper demonstrates that numerical hallucinations in LLMs originate from the Benford's Law-conforming digit frequency distribution in pretraining corpora—where digit 1 appears with ~30% probability while digit 9 appears with only ~5%—and that this bias is internalized by specific "digit-selective neurons" in the later FFN layers. A Digit Selectivity Coefficient (DSC) is proposed to localize biased neurons, and pruning 0.01% of neurons corrects 1.36–3.49% of erroneous predictions.
Beyond Higher Rank: Token-wise Input-Output Projections for Efficient Low-Rank Adaptation: TopLoRA analyzes the expressive capacity of LoRA from an input-output projection perspective, identifying that all tokens sharing a single projection matrix constitutes a critical bottleneck. It proposes dynamically adjusting LoRA weights via a learnable token-wise diagonal matrix $\Sigma_X$ (i.e., $\Delta W_X = B\Sigma_X A$), achieving fine-grained adaptation without increasing rank, and consistently outperforming LoRA by 2–3% across tasks.
Beyond Random: Automatic Inner-Loop Optimization in Dataset Distillation: This paper proposes AT-BPTT (Adaptive Truncation BPTT), which partitions DNN training into early/middle/late stages and adaptively adjusts truncation strategies and window sizes accordingly. The method achieves average accuracy gains of 3–17% on CIFAR-10/100/Tiny-ImageNet/ImageNet-1K, while delivering 3.9× speedup and 63% memory reduction.
Bézier Splatting for Fast and Differentiable Vector Graphics Rendering: Bézier Splatting integrates the Gaussian Splatting framework with Bézier curves by uniformly sampling 2D Gaussian points along each curve and rendering via α-blending to achieve differentiable vector graphics. The method achieves 30× forward and 150× backward speedups over DiffVG while maintaining or surpassing the image quality of methods such as LIVE.
Binary Quadratic Quantization: Beyond First-Order Quantization for Real-Valued Matrix Compression: BQQ proposes quadratic binary quantization—representing weight matrices as products (rather than linear combinations) of binary matrices—thereby surpassing the expressive capacity of conventional first-order quantization. By extending AMFD (Annealed Mean-Field Descent) to PUBO problems for mixed-integer optimization, BQQ achieves a dramatic accuracy leap from 10.83% to 58.25% on 2-bit data-free ViT quantization.
BioBench: A Blueprint to Move Beyond ImageNet for Scientific ML Benchmarks: BioBench is proposed as a unified benchmark spanning 9 ecological vision tasks, 4 taxonomic kingdoms, 6 image modalities, and 3.1 million images. It demonstrates that ImageNet top-1 accuracy explains only 34% of the variance across ecological tasks, and that approximately 30% of model rankings are incorrect among frontier models exceeding 75% accuracy.
C-LoRA: Contextual Low-Rank Adaptation for Uncertainty Estimation in Large Language Models: This paper proposes C-LoRA, which introduces a lightweight contextual module to condition the distribution of LoRA low-rank matrices on the input data, enabling sample-level heteroscedastic uncertainty estimation and significantly improving calibration quality in few-shot fine-tuning scenarios.
CAS-Spec: Cascade Adaptive Self-Speculative Decoding for On-the-Fly Lossless Inference Acceleration of LLMs: CAS-Spec constructs a multi-level draft model hierarchy from the target model itself via Dynamically Switchable Inference Acceleration (DSIA) strategies (e.g., layer sparsity at varying degrees), and employs the Dynamic Tree Cascade (DyTC) algorithm to adaptively route among draft models and allocate draft lengths based on online acceptance rates and latency predictions. The approach achieves lossless inference acceleration of 1.1×–2.3× in a fully training-free manner, with DyTC yielding gains of 47% and 48% over cascade and tree baselines, respectively.
ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference: ChunkKV elevates the basic unit of KV cache compression from discrete tokens to semantic chunks (groups of contiguous tokens). By aggregating attention scores at the chunk level, it selects semantically intact segments for retention, and leverages the high cross-layer index similarity induced by chunking to enable layer-wise index reuse. At a 10% compression ratio, ChunkKV improves over SnapKV/PyramidKV by up to 8.7% and achieves a 26.5% throughput gain.
CodeGEMM: A Codebook-Centric Approach to Efficient GEMM in Quantized LLMs: This paper proposes CodeGEMM, a codebook-centric GEMM kernel that precomputes inner products between centroids and activations and caches them as a Psumbook, replacing the conventional dequantization pipeline to achieve end-to-end speedups of 1.83× (8B) to 8.93× (70B) on 2-bit quantized LLMs.
Compress, Gather, and Recompute: REFORMing Long-Context Processing in Transformers: This paper proposes REFORM, an inference framework that efficiently processes ultra-long contexts (up to millions of tokens) via a compress–gather–recompute three-stage pipeline. REFORM achieves improvements of 52% and 34% over the strongest baselines on RULER and BABILong respectively, while reducing inference time by 30% and peak memory usage by 5%.
Correlation Dimension of Auto-Regressive Large Language Models: This paper introduces the correlation dimension from fractal geometry into LLM analysis. By measuring the recursive structure among next-token log-probability vectors, it quantifies the hierarchical complexity of text, revealing a three-stage evolution of LLM pretraining, an indicator of hallucination tendency, and a unified detection capability for multiple text degeneration patterns — none of which can be captured by perplexity.
Data Efficient Adaptation in Large Language Models via Continuous Low-Rank Fine-Tuning: This paper proposes DEAL, a framework that leverages wavelet kernel feature filtering to preserve core historical knowledge in LoRA low-rank matrices, combined with a controlled knowledge update module and asymmetric regularization, enabling LLMs to acquire new knowledge without forgetting old tasks under few-shot continual fine-tuning.
DeltaFlow: An Efficient Multi-frame Scene Flow Estimation Method: This paper proposes DeltaFlow (ΔFlow), which extracts motion cues via inter-frame voxel differences (Δ scheme) to enable multi-frame scene flow estimation with feature sizes that remain constant regardless of the number of input frames. The method achieves state-of-the-art performance on Argoverse 2, Waymo, and nuScenes while running 2× faster than the second-best multi-frame approach.
Dense Backpropagation Improves Training for Sparse Mixture-of-Experts: This paper proposes Default MoE, a method that maintains exponential moving averages (EMA) of inactive expert outputs as surrogate signals, enabling dense gradient updates for the MoE router without significant computational overhead, thereby improving sparse MoE training performance.
Dependency Parsing is More Parameter-Efficient with Normalization: This paper identifies that the lack of normalization in biaffine scoring for dependency and semantic parsing leads to systematic overparameterization, and demonstrates that a simple $1/\sqrt{d}$ scaling can reduce BiLSTM parameters by up to 85% while matching or surpassing original performance.
Deterministic Continuous Replacement: Fast and Stable Module Replacement in Pretrained Transformers: DCR mixes teacher and student module outputs via a deterministic annealing weight $\alpha(t)$, eliminating the gradient variance introduced by stochastic gating (e.g., BERT-of-Theseus), and achieves faster convergence and stronger feature alignment in cold-start module replacement scenarios.
Disentangling Latent Shifts of In-Context Learning with Weak Supervision: WILDA treats ICL as a weak supervision signal and encodes demonstration-induced latent shifts into lightweight LoRA adapters via a teacher-student framework, enabling efficient inference without repeated prompting. The student surpasses the teacher through pseudo-label correction and coverage extension, demonstrating weak-to-strong generalization.
Distillation Robustifies Unlearning: This paper reveals the core finding that "distillation can robustify unlearning" — distilling an unlearned model into a randomly initialized student network effectively discards latent capabilities. Building on this insight, the paper proposes UNDO (Unlearn-Noise-Distill-on-Outputs), which applies weight perturbation to the unlearned model prior to distillation, establishing a tunable compute–robustness trade-off that approaches the gold standard of retraining from scratch on both synthetic tasks and the WMDP benchmark.
DP-LLM: Runtime Model Adaptation with Dynamic Layer-wise Precision Assignment: DP-LLM identifies that per-layer quantization sensitivity varies dynamically across decoding steps, and proposes a dynamic layer-wise precision selection mechanism based on relative error. At runtime, each layer is assigned a precision (h-bit or l-bit) conditioned on the current input, achieving a better performance–latency trade-off than static mixed-precision methods.
DRAGON: Guard LLM Unlearning in Context via Negative Detection and Reasoning: DRAGON proposes a systematic LLM unlearning framework that requires no fine-tuning of the base model. It employs a two-layer detection module to identify prompts subject to unlearning, then uses a specially fine-tuned guard model to generate CoT reasoning instructions for in-context intervention, effectively removing private or harmful knowledge while preserving the model's general capabilities.
DuoGPT: Training-free Dual Sparsity through Activation-aware Pruning in LLMs: This paper proposes DuoGPT, a dual-sparse framework that reinterprets activation sparsity as dynamic structured weight sparsity and combines it with unstructured weight pruning. By extending the OBC framework with activation-aware calibration and a dense-model output residual correction term, DuoGPT achieves significant speedup and memory savings during the LLM decoding phase without any retraining.
Elastic ViTs from Pretrained Models without Retraining: SnapViT proposes a post-training structured pruning method that combines a local Hessian diagonal approximation derived from self-supervised gradients with global cross-module correlations estimated via evolutionary algorithms. Without any retraining or labels, it generates elastic ViT sub-networks spanning continuous sparsity levels in a single run, requiring less than 5 minutes on an A100 GPU.
FALQON: Accelerating LoRA Fine-tuning with Low-Bit Floating-Point Arithmetic: FALQON eliminates the small-matrix quantization overhead introduced by standalone LoRA paths by directly melding LoRA adapters into FP8-quantized backbone weights. Combined with efficient gradient computation and a row-wise proxy update mechanism, it achieves approximately 3× training speedup over existing quantized LoRA methods.
FastLongSpeech: Enhancing Large Speech-Language Models for Efficient Long-Speech Processing: FastLongSpeech is proposed to compress redundant speech representations via an iterative fusion strategy and to transfer short-speech capabilities to long-speech scenarios through dynamic compression training, enabling large speech-language models (LSLMs) to efficiently process long speech without long-speech training data, achieving state-of-the-art performance on long-speech QA with a 70% improvement in inference efficiency.
Find your Needle: Small Object Image Retrieval via Multi-Object Attention Optimization: MaO proposes a novel approach for Small Object Image Retrieval (SoIR) that integrates multi-object pre-training with attention-based feature refinement, aggregating representations of multiple objects into a single global descriptor, achieving substantial improvements over existing retrieval methods across multiple benchmarks.
FiRA: Can We Achieve Full-Rank Training of LLMs Under Low-Rank Constraint?: This paper proposes Fira, the first LLM training framework that achieves full-rank training (full-rank gradients + full-rank weights) under low-rank constraints. By observing that the optimizer scaling factors in low-rank and full-rank training are highly similar, Fira approximates the correction of out-of-subspace gradients using low-rank scaling factors, and employs a norm-growth limiter to prevent loss spikes. Fira outperforms LoRA and GaLore in both pretraining and fine-tuning settings.
FirstAidQA: A Synthetic Dataset for First Aid and Emergency Response in Low-Connectivity Settings: This paper introduces FirstAidQA, a dataset of 5,500 synthetic first aid question-answer pairs generated by ChatGPT-4o-mini from certified first aid textbooks and validated by human experts, designed to support fine-tuning of first aid AI systems in low-connectivity or offline environments.
Gated Integration of Low-Rank Adaptation for Continual Learning of Large Language Models: This paper proposes GainLoRA, which introduces a gating module for each new task's LoRA branch in continual learning to generate adaptive integration coefficients. By enforcing orthogonal constraints, the new branch's output on old tasks is driven toward zero, effectively mitigating catastrophic forgetting.
Geometric Data Valuation via Leverage Scores: This paper proposes a geometric data valuation method based on statistical leverage scores as an efficient proxy for Data Shapley values. The proposed method satisfies the axioms of symmetry, efficiency, and dummy player, and extends to ridge leverage scores to address the dimensionality saturation problem, providing theoretical guarantees of $O(\varepsilon)$-approximate optimality.
Geometry of Decision Making in Language Models: By measuring the Intrinsic Dimension (ID) of hidden representations across layers in 28 open-source Transformer models at scale, this paper reveals a consistent "low–high–low" pattern: early layers operate on low-dimensional manifolds, middle layers expand the representational space, and later layers re-compress into low-dimensional representations aligned with decision-making.
Global Minimizers of ℓp-Regularized Objectives Yield the Sparsest ReLU Neural Networks: This paper proves that, for single-hidden-layer ReLU networks, global minimizers of the $\ell^p$ ($0 < p < 1$) path norm correspond exactly to the sparsest data-interpolating networks, thereby recasting the combinatorial sparse interpolation problem as a continuously differentiable optimization task.
GoRA: Gradient-Driven Adaptive Low Rank Adaptation: GoRA is proposed to leverage pre-computed gradient information to simultaneously perform adaptive rank allocation and weight initialization prior to training — assigning per-layer ranks based on parameter sensitivity and initializing the $B$ matrix via the gradient pseudo-inverse so that the initial output approximates one step of gradient descent, thereby addressing both major bottlenecks of LoRA in a unified framework.
Graph Your Own Prompt: This paper proposes a Graph Consistency Regularization (GCR) framework that inserts parameter-free Graph Consistency Layers (GCL) at arbitrary network depths. GCL aligns the relational graph of intermediate features with a class-aware semantic graph derived from model predictions, promoting semantically consistent feature learning in a self-prompting manner—improving classification generalization without modifying the architecture or introducing additional parameters.
GraSS: Scalable Data Attribution with Gradient Sparsification and Sparse Projection: GraSS and FactGraSS are proposed as a two-stage gradient compression algorithm that exploits the inherent sparsity of per-sample gradients to achieve sublinear time and space complexity ($O(k')$), outperforming the SOTA baseline LoGra by 165% in throughput on billion-parameter models while maintaining data attribution quality.
Graver: Generative Graph Vocabularies for Robust Graph Foundation Models Fine-tuning: This paper proposes Graver, a framework that decouples ego-graphs to extract transferable subgraph vocabularies, models their distributions via graphon experts, and routes relevant vocabularies to augment support samples through MoE-CoE, addressing the instability caused by structural mismatch in few-shot fine-tuning of graph foundation models (GFMs).
Hankel Singular Value Regularization for Highly Compressible State Space Models: By regularizing the Hankel singular value nuclear norm of SSM layers during training to encourage rapid decay, the trained model can be compressed to 10% of its original order via balanced truncation with negligible accuracy loss. A block-diagonal rotation matrix parameterization reduces Gramian computation from $\mathcal{O}(n^3)$ to $\mathcal{O}(n^2)$.
Homogeneous Keys, Heterogeneous Values: Exploiting Local KV Cache Asymmetry for Long-Context LLMs: This paper identifies a previously overlooked local Key-Value asymmetry in LLM attention mechanisms—neighboring Keys exhibit homogeneity (similar attention weight distributions), while neighboring Values are heterogeneously distributed. Based on this observation, the paper proposes AsymKV, a training-free compression framework that merges Keys via homogeneity and represents Values losslessly through cardinality normalization, outperforming H2O by 5 points on LongBench.
Hyperbolic Dataset Distillation: This paper proposes HDD, the first method to incorporate hyperbolic space into dataset distillation. By matching the Riemannian centroids of real and synthetic data in the Lorentz hyperbolic space—rather than performing distribution matching in Euclidean space—HDD leverages the hierarchical weighting property of hyperbolic geometry to assign higher influence to more representative, low-level samples. The method consistently improves over DM/IDM baselines across multiple datasets.
Inference-Time Hyper-Scaling with KV Cache Compression: This paper proposes the Inference-Time Hyper-Scaling paradigm: by efficiently compressing the KV cache, more or longer parallel reasoning sequences can be generated under the same compute/memory budget, substantially improving the accuracy of reasoning models on tasks such as mathematics, code, and scientific reasoning.
KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments: This paper proposes KeyDiff — an attention-score-free KV cache eviction strategy that maintains the cache by retaining keys with the lowest average cosine similarity to other keys (i.e., geometrically most unique). Under strict memory constraints in block-wise inference settings, KeyDiff achieves ≤0.04% accuracy loss on LongBench with an 8K cache budget, while reducing end-to-end inference latency by up to 30%.
KINDLE: Knowledge-Guided Distillation for Prior-Free Gene Regulatory Network Inference: This paper proposes KINDLE, a three-stage framework that transfers gene regulatory knowledge learned by a prior-guided teacher model to a prior-free student model via knowledge distillation, achieving state-of-the-art performance in gene regulatory network (GRN) inference without relying on any external prior knowledge.
KTAE: A Model-Free Algorithm to Key-Tokens Advantage Estimation in Mathematical Reasoning: KTAE proposes a model-free token-level advantage estimation algorithm that quantifies the statistical association between each token and correct reasoning outcomes via Fisher's exact test and information gain. The resulting fine-grained token importance is superimposed on the rollout-level advantage of GRPO/DAPO, achieving superior performance on five mathematical reasoning benchmarks while significantly reducing generation length.
KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction: This paper proposes KVzip, a query-agnostic KV cache eviction method that quantifies the importance of each KV pair by leveraging the LLM itself to reconstruct the original context from the cached KV pairs. KVzip achieves 3–4× KV cache compression and approximately 2× reduction in FlashAttention decoding latency, while significantly outperforming existing query-aware methods in multi-query scenarios.
LayerIF: Estimating Layer Quality for Large Language Models using Influence Functions: LayerIF proposes using influence functions (IFs) to quantify the training quality of each layer in LLMs. By aggregating positive influence scores per layer, it derives a data-driven layer importance estimate, which is subsequently applied to two downstream tasks: LoRA-MoE expert allocation and layer-wise sparse pruning. The method achieves accuracy gains of 1.61% and 0.90% on Mistral-7B and Gemma-7B, respectively.
Learning Grouped Lattice Vector Quantizers for Low-Bit LLM Compression: GLVQ learns a dedicated lattice codebook (defined by a learnable generator matrix) for each weight group of an LLM, combined with group-specific μ-law companding to handle heavy-tailed distributions. Under 2-bit quantization, it achieves a Wikitext-2 perplexity of 3.36 on Llama-2-70B, substantially outperforming QuIP# (3.91) and QTIP (3.78).
Learning to Better Search with Language Models via Guided Reinforced Self-Training: This paper proposes Guided-ReST, which progressively incorporates optimal solutions as subgoals into model-generated search trajectories to produce high-quality training data and distill more efficient search strategies. The approach yields substantial improvements in search efficiency and accuracy on Countdown and code self-repair tasks.
Learning to Factorize and Adapt: A Versatile Approach Toward Universal Spatio-Temporal Foundation Models: This paper proposes FactoST-v2, a factorized spatio-temporal foundation model framework that decouples universal temporal pre-training from domain-specific spatial adaptation, achieving cross-domain zero-shot/few-shot/full-shot spatio-temporal forecasting with linear complexity.
Less is More but Where: Dynamic Token Compression via LLM-Guided Keyframe Prior: This paper proposes DyToK, a training-free dynamic video token compression method that leverages query-conditioned keyframe priors inherent in the deep attention layers of VLLMs to adaptively allocate token budgets across frames, achieving plug-and-play optimal efficiency–accuracy trade-offs.
Linear Attention for Efficient Bidirectional Sequence Modeling: This paper proposes Lion, a framework that, for the first time, systematically extends linear Transformers to bidirectional sequence modeling. It unifies three equivalent representations—full linear attention, bidirectional RNN, and chunkwise parallel—achieving training speeds up to 10× faster than SSMs while matching softmax Transformer performance.
LittleBit: Ultra Low-Bit Quantization via Latent Factorization: This paper proposes LittleBit, a framework that achieves extreme LLM compression down to 0.1 BPW (bits per weight) via low-rank latent-space matrix factorization, binarization, and a multi-scale compensation mechanism. It compresses Llama2-13B to under 0.9 GB and substantially outperforms STBLLM in the sub-1-bit regime.
Loquetier: A Virtualized Multi-LoRA Framework for Unified LLM Fine-tuning and Serving: Loquetier is a framework that unifies the fine-tuning and inference of multiple LoRA adapters within a single runtime via a Virtualized Module and a Segmented Multi-LoRA Multiplication (SMLM) kernel, achieving a 3.0× throughput improvement for inference-only tasks and a 46.4× higher SLO attainment rate for unified tasks.
LT-Soups: Bridging Head and Tail Classes via Subsampled Model Soups: This paper proposes LT-Soups, a two-stage model merging framework that trains multiple models on subsampled datasets with progressively varying imbalance ratios and aggregates them via weight averaging, achieving balanced performance across head and tail classes over the full long-tail spectrum.
Matryoshka Pilot: Learning to Drive Black-Box LLMs with LLMs: This paper proposes Matryoshka Pilot (M-Pilot), which employs a lightweight white-box LLM as a controller to generate intermediate guidance (task decomposition, high-level plans, user profiles) for driving black-box LLMs on complex long-horizon tasks such as reasoning, planning, and personalization, with iterative DPO enabling continual self-improvement.
Memory-Efficient Training with In-Place FFT Implementation: This paper proposes rdFFT—the first truly in-place real-domain Fast Fourier Transform framework—which eliminates intermediate buffers via an implicit complex encoding scheme, achieving zero extra memory overhead for FFT/IFFT computation during training, with memory efficiency improvements exceeding 1500× in extreme cases.
Mingle: Mixture of Null-Space Gated Low-Rank Experts for Test-Time Continual Model Merging: This paper proposes a new paradigm called Test-Time Continual Model Merging (TTCMM) and the Mingle framework, which employs a low-rank mixture-of-experts architecture with an adaptive null-space constrained gating mechanism to dynamically merge models at test time using a small number of unlabeled samples. Mingle outperforms state-of-the-art methods by 7–9% across multiple benchmarks while reducing forgetting to near zero.
Mitigating Semantic Collapse in Partially Relevant Video Retrieval: To address semantic collapse in Partially Relevant Video Retrieval (PRVR), this paper proposes Text Correlation Preservation Learning (TCPL) and Cross-Branch Video Alignment (CBVA), which mitigate collapse phenomena in the text and video embedding spaces respectively, achieving substantial improvements in retrieval accuracy.
Mixture of Noise for Pre-Trained Model-Based Class-Incremental Learning: This paper proposes learning beneficial "mixture of noise" to suppress parameter drift in pre-trained models during class-incremental learning. By dynamically mixing task-specific noise with learned weights across tasks, the method achieves state-of-the-art performance, particularly in the challenging 50-step incremental setting.
ModHiFi: Identifying High Fidelity Predictive Components for Model Modification: This paper proposes the Subset Fidelity metric and the ModHiFi framework. Through theoretical analysis, it proves that local reconstruction error linearly upper-bounds global prediction error for Lipschitz continuous networks. Without requiring training data, loss functions, or gradients—using only synthetic data—the framework identifies high-fidelity (HiFi) components within a model, and unifies the tasks of structured pruning and class unlearning under a single formulation.
Multi-Task Vehicle Routing Solver via Mixture of Specialized Experts under State-Decomposable MDP: This paper proposes the State-Decomposable MDP (SDMDP) framework, which reformulates multiple VRP variants as Cartesian products of base state spaces, and introduces the Mixture-of-Specialized-Experts Solver (MoSES), which leverages dedicated LoRA experts to enable latent space reuse of base policies, efficiently handling 16 VRP variants.
MUSTAFAR: Promoting Unstructured Sparsity for KV Cache Pruning in LLM Inference: This paper proposes MUSTAFAR, a framework that systematically demonstrates the superiority of unstructured sparsity for KV cache pruning—achieving 70% sparsity on both Key and Value caches without accuracy degradation—and introduces a bitmap-based sparse format with a custom attention kernel, yielding up to 2.23× end-to-end inference throughput improvement.
Navigating Simply, Aligning Deeply: Winning Solutions for Mouse vs. AI 2025: In the NeurIPS 2025 Mouse vs. AI competition, this paper presents the counterintuitive finding that a lightweight two-layer CNN substantially outperforms deep networks on visual robustness tasks, while demonstrating that a deeper ResNet architecture is more advantageous for neural alignment, revealing a fundamental tension between behavioral robustness and biological plausibility.
Offline Policy Evaluation of Multi-Turn LLM Health Coaching with Real Users: This paper conducts offline policy evaluation (OPE) on a deployed LLM health coaching system with real users. It finds that a uniformly high tool-use policy improves average reward but harms specific user subgroups. Through simulator experiments, the paper further validates that early information-gain exploration (curiosity reward) accelerates user profile identification and improves task success rates.
On the Creation of Narrow AI: Hierarchy and Nonlocality of Neural Network Skills: This paper investigates two fundamental challenges in creating narrow AI systems: the hierarchical dependencies among tasks require that certain narrow skills can only be learned effectively when trained on broad distributions; and the nonlocality of skills makes it impossible to precisely separate desired from undesired capabilities via pruning—yet pruning followed by recovery fine-tuning still outperforms both distillation and training from scratch.
On the Hardness of Approximating Distributions with Tractable Probabilistic Models: This paper proves that approximating arbitrary distributions with tractable probabilistic models (e.g., decomposable probabilistic circuits) under bounded $f$-divergence is NP-hard, and establishes an exponential size separation between decomposable PCs and (deterministic + decomposable) PCs under approximate modeling, demonstrating that approximation relaxations do not alleviate the complexity bottlenecks inherent in exact modeling.
One-Step Diffusion-Based Image Compression with Semantic Distillation: This paper proposes OneDC—the first one-step diffusion-based generative image codec—which replaces text with the hyperprior as the semantic conditioning signal for the diffusion model and enhances its representational capacity via semantic distillation, achieving state-of-the-art perceptual quality with 39% bitrate savings and 20× decoding speedup over multi-step diffusion codecs.
Online Mixture of Experts: No-Regret Learning for Optimal Collective Decision-Making: This paper proposes an Online Mixture of Experts (OMoE) framework comprising two algorithms — UCB-Successive Elimination and Online Weighted Majority Voting — with theoretical no-regret guarantees, and applies them to the online dynamic aggregation of LLM experts.
Optimizing Distributional Geometry Alignment with Optimal Transport for Generative Dataset Distillation: This paper reformulates dataset distillation as an optimal transport (OT) distance minimization problem and achieves fine-grained distributional geometry alignment through a three-stage pipeline (OT-guided diffusion sampling, label-image alignment soft re-labeling, and OT logit matching), yielding at least 4% improvement over the previous state of the art on ImageNet-1K at IPC=10.
Order-Level Attention Similarity Across Language Models: A Latent Commonality: This paper proposes Order-Level Attention (OLA)—an order-wise decomposition of Attention Rollout—and discovers that different language models exhibit significant similarity in same-order OLA (OLAS). OLA is shown to implicitly encode syntactic knowledge, and based on this finding, the paper proposes TOA, the first training-free cross-LM adapter transfer method.
ParetoQ: Improving Scaling Laws in Extremely Low-bit LLM Quantization: This paper proposes ParetoQ — the first unified framework supporting 1/1.58/2/3/4-bit quantization — which systematically studies training strategies (full-precision pretraining vs. QAT budget allocation) and quantization function design (introducing the SEQ quantizer). The work demonstrates that 2-bit and 1.58-bit quantization outperform conventional 4-bit in the accuracy–model-size trade-off, and achieves state-of-the-art results across all bit-widths.
PermLLM: Learnable Channel Permutation for N:M Sparse Large Language Models: This paper proposes PermLLM, the first learnable channel permutation (LCP) framework for N:M sparse LLMs. By relaxing discrete permutation matrices into differentiable soft permutation matrices via Sinkhorn normalization, PermLLM enables end-to-end optimization. Combined with a block-level permutation strategy that substantially reduces computational overhead, the framework effectively improves the performance of N:M sparse LLMs.
PPG-Distill: Efficient Photoplethysmography Signals Analysis via Foundation Model Distillation: PPG-Distill proposes a knowledge distillation framework tailored for PPG signals. By combining prediction-level, feature-level, and patch-level (morphology + rhythm) distillation, it transfers knowledge from large PPG foundation models to lightweight student models, achieving up to 21.8% performance improvement alongside 7× inference speedup and 19× memory compression.
Q-Palette: Fractional-Bit Quantizers Toward Optimal Bit Allocation for Efficient LLM Deployment: This paper derives optimal bit allocation for Gaussianized weights from an information-theoretic perspective, proposes the Q-Palette collection of fractional-bit quantizers and a mixed-scheme quantization framework, and achieves near-optimal quantization performance with inference acceleration in LLM deployment.
QSVD: Efficient Low-Rank Approximation for Unified Query-Key-Value Weight Compression: This paper proposes QSVD, which performs SVD on the joint QKV weight matrix and shares a single down-projection matrix across Q, K, and V to reduce KV cache size and computational overhead. Combined with importance-score-based adaptive rank allocation and a quantization scheme compatible with low-rank decomposition, QSVD achieves over 10% accuracy improvement on VLMs at lower hardware cost.
QuadEnhancer: Leveraging Quadratic Transformations to Enhance Deep Neural Networks: This paper proposes a lightweight quadratic enhancer (QuadEnhancer) that introduces sparsified quadratic interaction terms into each linear layer, achieving significant performance improvements over existing neural network architectures with negligible additional parameters and computational overhead.
Quantization Error Propagation: Revisiting Layer-Wise Post-Training Quantization: This paper identifies a critical bottleneck in existing layer-wise PTQ methods—namely, their neglect of cross-layer accumulation and growth of quantization errors—and proposes the QEP framework, which explicitly corrects accumulated errors via error propagation and compensation, achieving substantial performance gains under extremely low-bit settings (INT2/INT3).
RAT: Bridging RNN Efficiency and Attention Accuracy via Chunk-based Sequence Modeling: This paper proposes RAT (Recurrence And aTtention), a chunk-based intermediate architecture that models local dependencies within chunks via linear RNNs and enables global access across chunks via softmax attention. At $L=16$, RAT achieves a 9× single-layer decoding speedup and 10× maximum throughput improvement over standard attention with comparable performance; a hybrid variant alternating with sliding window attention achieves state-of-the-art results on nearly all benchmarks.
RCCDA: Adaptive Model Updates in the Presence of Concept Drift under a Constrained Resource Budget: This paper proposes RCCDA, a lightweight model update policy based on the Lyapunov drift-plus-penalty framework. Under concept drift scenarios where the data distribution shifts over time, RCCDA greedily determines when to retrain the model using only historical inference loss and a tunable threshold, while provably satisfying strict resource budget constraints.
Rectifying Soft-Label Entangled Bias in Long-Tailed Dataset Distillation: This paper identifies a dual entangled bias in soft labels within long-tailed dataset distillation — originating from both the distillation model and the distilled images — and proposes ADSA, an Adaptive Soft-label Alignment module that eliminates this bias via post-hoc calibration in logit space. As a plug-and-play module, ADSA integrates seamlessly into existing distillation pipelines, achieving up to 11.8% accuracy improvement on tail classes on ImageNet-1k-LT.
Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs: This paper proposes rLiVS (Recurrent LLM-informed Visual Selection), a training-free and model-agnostic method for streaming video understanding. It achieves state-of-the-art performance on streaming video benchmarks through three complementary designs: LLM attention-guided visual token selection (retaining only ~6% of tokens), recurrent reuse of historical tokens, and caption-based retrieval for question answering.
RefLoRA: Refactored Low-Rank Adaptation for Efficient Fine-Tuning of Large Models: RefLoRA selects the optimal low-rank factorization form at each iteration by minimizing an upper bound on the loss, thereby addressing the weight update inconsistency and imbalance caused by the non-uniqueness of the LoRA decomposition. It accelerates convergence and improves fine-tuning performance with negligible additional computational overhead.
Reject Only Critical Tokens: Pivot-Aware Speculative Decoding: PAD proposes a new speculative decoding paradigm based on utility matching rather than distribution matching. It trains a lightweight classifier to identify pivot tokens and rejects only those draft tokens that would degrade final output utility, achieving a 2.46× speedup on GSM8K with negligible accuracy loss.
REOrdering Patches Improves Vision Models: This paper reveals that patch ordering significantly affects the performance of long-sequence vision models, and proposes the REOrder framework, which leverages information-theoretic priors and reinforcement learning to automatically discover optimal patch permutations, achieving up to 3.01% improvement on ImageNet-1K and 13.35% on FMoW.
REP: Resource-Efficient Prompting for Rehearsal-Free Continual Learning: REP reduces training time by up to 51% and memory consumption by up to 41% for prompt-based rehearsal-free continual learning methods, with negligible accuracy loss, via three complementary techniques: fast prompt selection using a lightweight surrogate model, Adaptive Token Merging (AToM), and Adaptive Layer Dropping (ALD).
ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization: ReplaceMe is a training-free depth pruning method that uses a small calibration dataset to estimate a linear transformation approximating groups of pruned Transformer blocks. This transformation is fused into adjacent layer weights without introducing additional parameters, achieving 25% pruning on LLaMA-2-7B while retaining approximately 90% of original performance.
Representation Consistency for Accurate and Coherent LLM Answer Aggregation: This paper proposes Representation Consistency (RC), which improves answer aggregation by analyzing the consistency of internal activations when an LLM generates multiple candidate answers. Reasoning paths that yield the same answer with highly consistent internal representations are more likely to be correct. A sparse variant, RC-S, leveraging sparse autoencoders achieves the best performance, consistently outperforming Self-Consistency across 4 LLMs and 4 reasoning datasets.
Restoring Pruned Large Language Models via Lost Component Compensation: RestoreLCC proposes a targeted recovery strategy for pruned LLMs: it uses contrastive probing to identify critical attention heads, applies SVD decomposition to extract activation components lost during pruning, and injects them back into the pruned model as learnable bias vectors — significantly restoring performance without compromising sparsity or inference speed.
Revisiting Semi-Supervised Learning in the Era of Foundation Models: A systematic study reveals that conventional SSL methods offer limited benefit in the VFM era—PEFT on labeled data alone can match SSL—motivating V-PET: a simple and effective semi-supervised learning approach that ensembles pseudo-labels from multiple PEFT methods and multiple VFMs.
Robust Federated Finetuning of LLMs via Alternating Optimization of LoRA: This paper proposes RoLoRA, which alternately optimizes the down-projection ($\mathbf{A}$) and up-projection ($\mathbf{B}$) matrices of LoRA to address imprecise aggregation and limited expressiveness in federated learning. RoLoRA significantly outperforms FedAVG of LoRA and FFA-LoRA on RoBERTa-Large and Llama-2-7B.
Robustifying Learning-Augmented Caching Efficiently without Compromising 1-Consistency: This paper proposes Guard, a lightweight robustification framework that improves the robustness of a broad class of learning-augmented caching algorithms to $2H_{k-1}+2$ while preserving 1-consistency and incurring only O(1) additional overhead per request.
S2M-Former: Spiking Symmetric Mixing Branchformer for Brain Auditory Attention Detection: This paper proposes S2M-Former, a spiking-driven symmetric mixing Branchformer framework that achieves SOTA-level accuracy on EEG-based auditory attention detection with only 0.06M parameters, via complementary learning across spatial-frequency dual branches and lightweight 1D token representations, while reducing energy consumption to 1/5.8 of dual-branch ANN counterparts.
Ensemble++: Scalable Exploration via Ensemble: This paper proposes Ensemble++, which achieves regret bounds comparable to exact Thompson Sampling using only $\Theta(d\log T)$ ensemble size via an incremental update mechanism over shared factor matrices, with natural extension to nonlinear/neural network settings.
Single-Teacher View Augmentation: Boosting Knowledge Distillation via Angular Diversity: This paper proposes Angular-KD, which attaches multiple lightweight linear branches to a single teacher model and introduces two angular diversity losses — a constrained inter-angle diversity loss and an intra-angle diversity loss — to generate diverse supervisory signals from a single teacher. This approach serves as a low-cost alternative to multi-teacher distillation and achieves state-of-the-art performance across multiple KD benchmarks.
Skrull: Towards Efficient Long Context Fine-tuning through Dynamic Data Scheduling: To address the training inefficiency caused by mixing long and short sequences in Long-context Supervised Fine-Tuning (Long-SFT), this paper proposes Skrull, a dynamic data scheduler consisting of two components — Distribution-Aware Context Parallelism (DACP) and Global Data Scheduling (GDS) — achieving an average 3.76× (up to 7.54×) training speedup in realistic Long-SFT scenarios.
Smooth Regularization for Efficient Video Recognition: This paper proposes a Gaussian Random Walk (GRW)-based smooth regularization technique that imposes temporal smoothness constraints (penalizing high-acceleration changes) on intermediate-layer embeddings of video recognition models, achieving 3.8%–6.4% accuracy improvements on lightweight models and establishing a new state of the art on Kinetics-600 under corresponding FLOP constraints.
Spark Transformer: Reactivating Sparsity in FFN and Attention: This paper proposes the Spark Transformer architecture, which simultaneously achieves high-level activation sparsity in both FFN and attention mechanisms (only 8% of neurons activated in FFN; each token attends to at most 256 tokens) via a Statistical Top-k operator. The approach achieves a 2.5× FLOPs reduction and up to 1.79× inference speedup while maintaining quality comparable to Gemma-2.
SpecAttn: Speculating Sparse Attention: SpecAttn proposes a training-free method that leverages attention weights already computed by the draft model in speculative decoding to predict important tokens for the verification model. Through KL divergence layer mapping, sorting-free top-p nucleus selection, and dynamic KV cache pruning, it achieves a 78.4% reduction in KV cache accesses with only a 15.29% increase in perplexity, significantly outperforming existing sparse attention methods.
Specialization after Generalization: Towards Understanding Test-Time Training in Foundation Models: This paper proposes a "specialization after generalization" framework that theoretically and empirically explains the effectiveness of test-time training (TTT) on in-distribution data under the Linear Representation Hypothesis (LRH). Foundation models are globally underparameterized, leading to concept superposition interference. TTT mitigates this by locally specializing the model—reallocating model capacity to the small subset of concepts relevant to the test task—thereby improving predictive performance without increasing model size.
Spiking Brain Compression: Post-Training Second-Order Compression for Spiking Neural Networks: This paper proposes Spiking Brain Compression (SBC), a second-order post-training one-shot compression framework based on the Van Rossum Distance, designed specifically for spiking neural networks (SNNs). By introducing a Surrogate Membrane Potential (SMP) Hessian, SBC enables efficient module-wise pruning and quantization, and for the first time compresses SEW-ResNet152 and Spike-Driven Transformer at the ImageNet scale.
Synergy between the Strong and the Weak: Spiking Neural Networks Are Inherently Superior in Temporal Processing: This paper identifies that SNNs can be naturally decomposed into multiple sub-models along the temporal dimension. By comparing output confidence across timestep sub-models to identify "strong" and "weak" instances, the paper proposes two self-distillation schemes — Strong2Weak and Weak2Strong — that significantly improve SNN performance without any external teacher model, achieving gains of up to 5.36% on neuromorphic datasets.
The Graphon Limit Hypothesis: Understanding Neural Network Pruning via Infinite Width Analysis: This paper proposes the "Graphon Limit Hypothesis": as network width tends to infinity, the binary mask sequences produced by different pruning methods converge, under the cut distance, to their respective unique graphon limits. Building on this foundation, the paper derives a Graphon NTK to analyze the training dynamics of sparse networks, providing a theoretical explanation for why different pruning methods yield markedly different performance at the same sparsity level.
The Structure of Relation Decoding Linear Operators in Large Language Models: This paper reveals that linear relation embeddings (LREs) in Transformer language models do not encode fine-grained relations but instead extract shared coarse-grained semantic attributes (e.g., "country," "gender"). A rank-3 tensor network is employed to compress large collections of relation decoding matrices by several orders of magnitude.
Tighter CMI-Based Generalization Bounds via Stochastic Projection and Quantization: By incorporating stochastic projection and lossy compression into the CMI (conditional mutual information) framework, this paper derives tighter generalization bounds, resolves the failure of classical CMI bounds on SCO counterexamples, and proves that memorization is not necessary for good generalization.
TokenSqueeze: Performance-Preserving Compression for Reasoning LLMs: TokenSqueeze proposes a three-stage pipeline — adaptive reasoning depth selection, intra-step linguistic refinement (with KL divergence constraints), and length-aware preference optimization — achieving 50% token compression of reasoning chains without accuracy degradation, using only self-generated data.
Toward Efficient Inference Attacks: Shadow Model Sharing via Mixture-of-Experts: This paper proposes a Mixture-of-Experts (MoE)-based shadow model sharing framework that reduces the overall training cost of shadow models by sharing feature extraction layers across multiple inference attack tasks while training only lightweight task-specific expert modules, maintaining or improving attack performance.
Towards Effective Federated Graph Foundation Model via Mitigating Knowledge Entanglement: This work is the first to propose the Federated Graph Foundation Model (FedGFM) paradigm, which integrates the distributed collaborative capability of federated graph learning with the cross-domain generalization capability of graph foundation models. Two modules — AncDAI (Anchor-based Domain-Aware Initialization) and AdaDPP (Adaptive Domain-sensitive Prompt Pool) — are introduced to mitigate knowledge entanglement, achieving state-of-the-art performance on 8 cross-task, cross-domain datasets against 20 baselines.
Towards Unsupervised Open-Set Graph Domain Adaptation via Dual Reprogramming: This paper proposes GraphRTA, a framework that addresses the challenges of known-class classification and unknown-class detection in unsupervised open-set graph domain adaptation through two complementary mechanisms: model reprogramming (gradient-guided weight pruning) and graph reprogramming (target graph structure and feature optimization), without requiring manually specified thresholds.
Train with Perturbation, Infer after Merging: A Two-Stage Framework for Continual Learning: This paper proposes the Perturb-and-Merge (P&M) framework, which introduces model merging mechanisms into the continual learning paradigm. During training, random perturbations are added along the task vector direction to smooth the loss landscape; during inference, a closed-form optimal coefficient is used to compute a convex combination of the historical model and the current task model. Combined with LoRA, the framework achieves memory-efficient state-of-the-art continual learning performance.
Traversal Verification for Speculative Tree Decoding: This paper proposes Traversal Verification, a bottom-up verification algorithm that traverses from leaf nodes to the root. Rather than making acceptance/rejection decisions based on per-token probabilities, it considers the sequence-level probability of entire paths, thereby maximizing candidate utilization. The method is theoretically proven to be lossless and optimal on single chains, and consistently improves acceptance length by 2.2%–5.7% across diverse tree structures and tasks.
Twilight: Adaptive Attention Sparsity with Hierarchical Top-p Pruning: This paper proposes Twilight, which replaces fixed-budget top-k attention sparsity with a top-p (nucleus sampling) inspired approach — dynamically selecting the minimum set of tokens whose cumulative attention weights reach p%, adapting to the distribution characteristics of different attention heads. Twilight achieves up to 1.4× additional speedup over state-of-the-art sparse attention methods while maintaining accuracy.
Understanding Differential Transformer Unchains Pretrained Self-Attentions: This paper conducts an in-depth analysis of the internal mechanism of the Differential Transformer, revealing that the differential operation is equivalent to a robust attention denoising process — it "unchains" pretrained self-attentions from the constraints of softmax normalization, enabling attention weights to be more freely allocated to genuinely important tokens.
Uni-LoRA: One Vector is All You Need: This paper proposes Uni-LoRA, a unified framework demonstrating that the parameter reduction strategies of various LoRA variants (Tied-LoRA, VeRA, VB-LoRA, etc.) are fundamentally distinguished by the choice of projection matrix mapping the full parameter space $\mathbb{R}^D$ to a low-dimensional subspace $\mathbb{R}^d$. An isometric random grouping projection matrix is designed such that training a single vector suffices to reconstruct all LoRA parameters of an LLM, achieving extreme parameter efficiency.
Universal Cross-Tokenizer Distillation via Approximate Likelihood Matching: This paper proposes Approximate Likelihood Matching (ALM), a principled cross-tokenizer distillation method based on binarized f-divergence, which for the first time enables effective distillation and pure distillation across fundamentally different tokenizers (e.g., subword → byte-level).
VESSA: Video-based objEct-centric Self-Supervised Adaptation for Visual Foundation Models: VESSA proposes an unsupervised adaptation method for visual foundation models using short object-centric videos. Through a self-distillation framework combined with LoRA parameter-efficient fine-tuning and an uncertainty-weighted loss, it significantly improves downstream classification performance in target domains without requiring any labeled data.
Vision-centric Token Compression in Large Language Model: Vist proposes a vision-centric slow-fast dual-path token compression framework that renders distant long-context text as images and compresses them with a lightweight vision encoder, coupled with a Probability-guided Visual Enhancement (PVE) training objective. Across 11 ICL benchmarks, it achieves comparable accuracy with 2.3× fewer tokens, reducing FLOPs by 16% and memory by 50%.
VQToken: Neural Discrete Token Representation Learning for Extreme Token Reduction in Video Large Language Models: VQToken introduces the first vector-quantization-based framework for extreme video token compression. By adaptively discretizing continuous ViT embeddings into a compact codebook and preserving spatiotemporal positional information via a token hash function, it achieves only 0.66% accuracy loss on NextQA-MC using merely 0.07% of the original tokens (approximately 13 tokens).
When Worse is Better: Navigating the Compression-Generation Trade-off in Visual Tokenization: This paper systematically investigates the trade-off between visual tokenizer compression rate and generation quality through scaling laws. It finds that more aggressive compression—despite yielding worse reconstruction—benefits generation for smaller models. The paper proposes Causally Regularized Tokenization (CRT), which embeds autoregressive inductive bias into Stage 1 training, achieving 2–3× computational efficiency gains. A 775M-parameter model with 256 tokens/image matches LlamaGen-3B's FID of 2.18.
zip2zip: Inference-Time Adaptive Tokenization via Online Compression: This paper proposes zip2zip, which deeply integrates the classical LZW online lossless compression algorithm into the LLM inference pipeline. During decoding, frequently co-occurring tokens are continuously merged into reusable "hypertokens" to dynamically expand the vocabulary. Combined with a dynamic embedding layer and training on compressed-space language modeling, zip2zip enables existing LLMs to acquire inference-time adaptive tokenization capability with only 10 GPU-hours of LoRA fine-tuning, achieving 15–40% reduction in input/output sequence length and up to 40% reduction in end-to-end decoding latency, with negligible downstream task performance degradation.

🧊 3D Vision¶

3D-Agent: Tri-Modal Multi-Agent Collaboration for Scalable 3D Object Annotation: This paper proposes Tri-MARF, a tri-modal multi-agent framework comprising a VLM annotation agent (multi-view, multi-candidate description generation), an information aggregation agent (BERT clustering + CLIP weighting + UCB1 Multi-Armed Bandit selection), and a point cloud gating agent (Uni3D text–point cloud alignment for hallucination filtering). The system achieves a CLIPScore of 88.7 (surpassing human annotation at 82.4), a throughput of 12k objects/hour, and has annotated approximately 2 million 3D models.
3D Visual Illusion Depth Estimation: This paper reveals that 3D visual illusions (e.g., wall paintings, screen replays, mirror reflections) severely mislead existing state-of-the-art monocular and stereo depth estimation methods. The authors construct a large-scale dataset comprising approximately 3k scenes and 200k images, and propose a VLM-driven monocular-stereo adaptive fusion framework that achieves state-of-the-art performance across diverse illusion scenarios.
Anti-Aliased 2D Gaussian Splatting: This paper proposes AA-2DGS, which addresses severe aliasing artifacts in 2D Gaussian Splatting under varying sampling rates through two complementary mechanisms: a world-space flat smoothing kernel and an object-space Mip filter. The method significantly improves multi-scale rendering quality while preserving the geometric accuracy advantages of 2DGS.
ARMesh: Autoregressive Mesh Generation via Next-Level-of-Detail Prediction: This paper proposes to formulate 3D mesh generation as a coarse-to-fine, next-level-of-detail prediction process. By reversing a generalized mesh simplification algorithm (GSlim), a progressive refinement sequence is obtained, which is then learned autoregressively via a Transformer. Generation begins from a single point and incrementally adds geometric and topological detail to produce a complete mesh.
AtlasGS: Atlanta-world Guided Surface Reconstruction with Implicit Structured Gaussians: AtlasGS is proposed to achieve smooth, high-frequency-detail-preserving surface reconstruction in indoor and urban scenes by incorporating the Atlanta-world structural prior into an implicit-structured Gaussian representation, comprehensively outperforming existing implicit and explicit methods.
BecomingLit: Relightable Gaussian Avatars with Hybrid Neural Shading: This paper proposes BecomingLit, a method that reconstructs high-fidelity, relightable, and real-time renderable head avatars from low-cost light stage multi-view sequences using 3D Gaussian primitives and hybrid neural shading (neural diffuse BRDF + analytic Cook-Torrance specular). A new publicly available OLAT facial dataset is also released.
CLIPGaussian: Universal and Multimodal Style Transfer Based on Gaussian Splatting: CLIPGaussian proposes the first unified style transfer framework based on Gaussian Splatting, supporting text- and image-guided stylization of 2D images, videos, 3D objects, and 4D dynamic scenes. It integrates as a plug-and-play module into existing GS pipelines without requiring large generative models or retraining from scratch, and without altering model size.
Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations: Concerto combines intra-modal 3D point cloud self-distillation with cross-modal 2D-3D joint embedding prediction. Through a minimalist design, a single point cloud encoder (PTv3) emerges spatial representations that surpass both 2D/3D unimodal methods and their naive concatenation, achieving state-of-the-art performance on multiple 3D scene understanding benchmarks (ScanNet semantic segmentation: 80.7% mIoU).
Copresheaf Topological Neural Networks: A Generalized Deep Learning Framework: This paper proposes Copresheaf Topological Neural Networks (CTNNs), which leverage the algebraic-topological notion of copresheaves to define directional, heterogeneous message passing on combinatorial complexes. The framework unifies CNNs, GNNs, Transformers, Sheaf Neural Networks, and Topological Neural Networks as special cases, and surpasses conventional baselines on physics simulation, graph classification, and higher-order complex classification tasks.
CosmoBench: A Multiscale, Multiview, Multitask Cosmology Benchmark for Geometric Deep Learning: This paper introduces CosmoBench—the largest cosmological geometric deep learning benchmark to date—comprising 34,752 point clouds and 24,996 directed trees across multiple scales, viewpoints, and tasks. A key finding is that simple linear models sometimes outperform large GNNs.
Cue3D: Quantifying the Role of Image Cues in Single-Image 3D Generation: Cue3D is the first model-agnostic framework for quantifying the importance of image cues in single-image 3D generation. By systematically perturbing six visual cues—illumination, texture, silhouette, perspective, edges, and local continuity—across seven methods spanning three paradigms (regression-based, multi-view, and native 3D generation), it reveals key insights: shape meaningfulness rather than texture governs generalization ability, illumination matters more than texture, and models are overly dependent on input silhouettes.
D$^2$USt3R: Enhancing 3D Reconstruction for Dynamic Scenes: This paper proposes the Static-Dynamic Aligned Pointmap (SDAP) representation, which unifies 3D alignment of static and dynamic regions into a single framework, enabling DUSt3R-based methods to achieve accurate dense 3D reconstruction and correspondence estimation in dynamic scenes.
DC4GS: Directional Consistency-Driven Adaptive Density Control for 3D Gaussian Splatting: This paper proposes DC4GS, an adaptive density control method based on Directional Consistency (DC), which improves primitive splitting decisions and split position selection in 3DGS by exploiting the angular coherence of positional gradients. DC4GS reduces the number of primitives by up to 30% while improving reconstruction quality.
DGH: Dynamic Gaussian Hair: This paper proposes Dynamic Gaussian Hair (DGH), a data-driven coarse-to-fine framework that learns hair dynamics via a volumetric implicit deformation model, and achieves photorealistic novel-view rendering of dynamic hair by combining cylindrical Gaussian representations with a curvature blending strategy.
DualFocus: Depth from Focus with Spatio-Focal Dual Variational Constraints: This paper proposes DualFocus, which achieves robust and accurate depth estimation from focal stacks via two complementary constraints: a spatial variational constraint (exploiting focus-dependent gradient patterns to distinguish depth edges from texture artifacts) and a focal variational constraint (enforcing a unimodal and monotonic focus probability distribution along the focal axis).
Dynamic Gaussian Splatting from Defocused and Motion-blurred Monocular Videos: A unified framework is proposed that jointly models defocus blur and motion blur via learnable blur kernel convolution, combined with a dynamic Gaussian densification strategy and unseen-view constraints, enabling high-quality novel view synthesis of dynamic scenes from blurry monocular videos using 3DGS.
DynaRend: Learning 3D Dynamics via Masked Future Rendering for Robotic Manipulation: DynaRend is proposed to jointly learn 3D geometry, semantics, and dynamics on triplane representations via differentiable volumetric rendering, using two complementary objectives — masked reconstruction and future prediction — enabling efficient transfer to downstream robotic manipulation tasks after pre-training.
E-MoFlow: Learning Egomotion and Optical Flow from Event Data via Implicit Regularization: This paper proposes E-MoFlow, which models optical flow as an implicit neural representation and egomotion as a continuous spline, jointly optimizing both via differential geometric constraints under an unsupervised paradigm to achieve 6-DoF egomotion and dense optical flow estimation from event data.
EA3D: Online Open-World 3D Object Extraction from Streaming Videos: This paper proposes EA3D (ExtractAnything3D), an online open-world 3D object extraction framework that performs simultaneous geometric reconstruction and comprehensive scene understanding from streaming videos via knowledge-integrated feature maps, online visual odometry, and recurrent joint optimization.
EAG3R: Event-Augmented 3D Geometry Estimation for Dynamic and Extreme-Lighting Scenes: EAG3R integrates asynchronous event streams from event cameras into the MonST3R point map reconstruction framework. Through a Retinex enhancement module, an SNR-aware fusion mechanism, and an event photometric consistency loss, it achieves robust depth estimation, pose tracking, and 4D reconstruction in extreme low-light dynamic scenes, significantly outperforming RGB-only methods via zero-shot transfer to nighttime scenarios.
EF-3DGS: Event-Aided Free-Trajectory 3D Gaussian Splatting: EF-3DGS is the first work to introduce event cameras into free-trajectory scene reconstruction. It employs an Event Generation Model (EGM) to reconstruct latent inter-frame images for continuous supervision, Contrast Maximization (CMax) combined with a Linear Event Generation Model (LEGM) to extract motion information for pose calibration, and a photometric BA + Fixed-GS strategy to resolve color inconsistency. The method achieves a 3 dB PSNR improvement and a 40% reduction in ATE in high-speed scenarios.
ELECTRA: A Cartesian Network for 3D Charge Density Prediction with Floating Orbitals: This paper proposes ELECTRA (Electronic Tensor Reconstruction Algorithm), an equivariant Cartesian tensor network that reconstructs electron density by predicting the positions, weights, and covariance matrices of floating Gaussian orbitals. On the QM9 benchmark, ELECTRA achieves 2.4× higher accuracy than the state-of-the-art method SCDP while being 4.4–11× faster at inference, and reduces the number of SCF iterations in DFT by 50.72%.
EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation: EnerVerse is a generative robotic foundation model that constructs a 4D embodied space via chunk-wise autoregressive video diffusion, sparse context memory, and multi-view generation priors. Combined with a 4DGS data flywheel to narrow the Sim2Real gap, it translates 4D world representations into physical actions through a policy head, achieving state-of-the-art performance on the LIBERO benchmark.
EUGens: Efficient, Unified, and General Dense Layers: EUGens introduces a new family of efficient dense layers that leverage Random Features to reduce the inference complexity of fully connected feedforward layers (FFLs) from quadratic to linear. The framework unifies existing efficient FFL extensions and achieves up to 27% inference speedup and 30% parameter compression across LLM pre-training, ViT image classification, and NeRF/iSDF 3D reconstruction tasks, while supporting layer-wise knowledge distillation without backpropagation.
Evaluation of Vision-LLMs in Surveillance Video: This paper proposes a training-free two-stage framework that leverages small Vision-LLMs to generate textual descriptions of video content, followed by an NLI classifier for zero-shot scoring. It systematically evaluates the impact of prompting strategies and privacy-preserving filters on anomalous behavior recognition in surveillance videos.
Every Camera Effect, Every Time, All at Once: 4D Gaussian Ray Tracing for Physics-based Camera Effect Data Generation: This paper proposes 4D Gaussian Ray Tracing (4D-GRT), which integrates 4D Gaussian Splatting with physics-based ray tracing. After reconstructing dynamic scenes from multi-view videos, the method renders physically accurate video data with controllable camera effects including fisheye distortion, depth of field blur, and rolling shutter artifacts.
Fin3R: Fine-tuning Feed-forward 3D Reconstruction Models via Monocular Knowledge Distillation: Fin3R is proposed to improve the geometric accuracy and robustness of feed-forward 3D reconstruction models (DUSt3R/MASt3R/CUT3R/VGGT) in a unified and lightweight manner, by freezing the decoder and fine-tuning the encoder via monocular knowledge distillation with re-normalization LoRA adapters.
FlareX: A Physics-Informed Dataset for Lens Flare Removal via 2D Synthesis and 3D Rendering: This paper proposes the FlareX dataset, generated through three stages—parameterized template creation, illumination-law-guided 2D synthesis, and physics-engine-based 3D rendering—to produce physically realistic lens flare data. Models trained on FlareX significantly outperform those trained on all prior datasets on real-world test sets.
Flux4D: Flow-based Unsupervised 4D Reconstruction: Flux4D is proposed as an unsupervised and generalizable 4D dynamic driving scene reconstruction framework. It employs a feed-forward network to directly predict 3D Gaussians and their motion velocities, achieving large-scale scene reconstruction using only photometric loss and a static-preference regularization. The method surpasses all unsupervised approaches on PandaSet and Waymo while approaching the performance of supervised methods.
From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes: This paper proposes Anywhere3D-Bench, the first 3D visual grounding benchmark spanning four levels—area, space, object, and part—revealing that even the strongest models (Gemini-2.5-Pro and o3) achieve only ~30% accuracy on space-level tasks and ~40% on part-level tasks, far below the human performance of 95%.
From Pixels to Views: Learning Angular-Aware and Physics-Consistent Representations for Light Field Microscopy: This paper proposes XLFM-Former, which learns angular–spatial priors of XLFM through view-level Masked View Modeling (MVM-LF) self-supervised pretraining, and introduces an Optical Rendering Consistency Loss (ORC Loss) based on PSF differentiable rendering to constrain the physical plausibility of the reconstructed volume. On the first standardized XLFM-Zebrafish benchmark constructed by the authors, the method achieves an average PSNR of 54.04 dB, surpassing the best baseline ConvNeXt (50.16 dB) by 7.7%.
From Programs to Poses: Factored Real-World Scene Generation via Learned Program Libraries: This paper proposes FactoredScenes, which decomposes real-world 3D scene generation into a five-step factorization pipeline — learning a layout program library from synthetic data, generating scene programs via LLM, executing programs to obtain axis-aligned layouts, program-conditioned hierarchical pose prediction, and object retrieval and placement. The method achieves 38.3% FID improvement and 80.4% KID improvement on bedrooms, with human evaluators able to distinguish generated scenes from real ScanNet scenes only 67% of the time.
Fully Dynamic Algorithms for Chamfer Distance: This paper proposes the first fully dynamic algorithm for maintaining Chamfer distance, reducing the problem to approximate nearest neighbor (ANN) queries to achieve a $(1+\epsilon)$ approximation with update time $\tilde{O}(\epsilon^{-d})$, significantly surpassing the linear-time lower bound of static recomputation. On real-world datasets, the algorithm achieves <10% relative error while running orders of magnitude faster than naive approaches.
Galactification: Painting Galaxies onto Dark Matter Only Simulations Using a Transformer-Based Model: This paper proposes a multimodal Transformer encoder–decoder framework that takes density and velocity fields from inexpensive dark matter N-body simulations as input and autoregressively generates galaxy catalogs (positions + physical properties). The model faithfully reproduces hydrodynamical simulation results across multiple statistical metrics while achieving approximately 100× computational speedup.
GauDP: Reinventing Multi-Agent Collaboration through Gaussian-Image Synergy in Diffusion Policies: GauDP is proposed to enable scalable, perception-enhanced multi-agent collaborative imitation learning by constructing a globally consistent 3D Gaussian field from decentralized RGB observations of multiple agents and dynamically allocating Gaussian attributes back to each agent's local viewpoint.
Gaussian-Augmented Physics Simulation and System Identification with Complex Colliders: This paper proposes AS-DiffMPM, a differentiable Material Point Method (MPM) framework supporting arbitrary-shape rigid body colliders, combined with multiple novel-view synthesis methods to enable system identification of physical parameters from visual observations.
Gaze Beyond the Frame: Forecasting Egocentric 3D Visual Span: This paper proposes EgoSpanLift, a method that lifts egocentric 2D gaze predictions into 3D space, constructing multi-level volumetric visual span representations. Combined with a 3D U-Net and a causal Transformer, the framework forecasts future 3D regions of visual attention.
GeoComplete: Geometry-Aware Diffusion for Reference-Driven Image Completion: This paper proposes GeoComplete, which injects projected point clouds as geometric conditions into a dual-branch diffusion model and employs a target-aware masking strategy to achieve geometrically consistent reference-driven image completion, achieving a 17.1% improvement in PSNR.
GeoSVR: Taming Sparse Voxels for Geometrically Accurate Surface Reconstruction: This paper proposes GeoSVR, an explicit surface reconstruction framework based on sparse voxels. By introducing voxel-uncertainty depth constraints and sparse voxel surface regularization, GeoSVR comprehensively outperforms existing 3DGS- and SDF-based methods in geometric accuracy, detail preservation, and reconstruction completeness.
GOATex: Geometry & Occlusion-Aware Texturing: GOATex proposes the first occlusion-aware 3D mesh texturing framework. It decomposes meshes into visibility layers ordered from outermost to innermost via a ray-casting-based hit-level mechanism, applies a two-stage visibility control strategy combining normal flipping and residual face clustering, and performs visibility-weighted blending in UV space—achieving high-quality texture generation for both exterior surfaces and occluded interior surfaces.
HAIF-GS: Hierarchical and Induced Flow-Guided Gaussian Splatting for Dynamic Scene: HAIF-GS proposes a dynamic 3DGS framework built upon sparse motion anchors, achieving state-of-the-art rendering quality on the NeRF-DS and D-NeRF benchmarks via three key mechanisms: an anchor filter that separates dynamic and static regions, a self-supervised induced scene flow that guides temporally consistent deformation, and hierarchical anchor densification that captures fine-grained non-rigid motion.
High Resolution UDF Meshing via Iterative Networks: This paper proposes the first iterative meshing method for Unsigned Distance Fields (UDFs), which progressively propagates neighborhood information into local voxel pseudo-sign predictions through multiple forward passes. The approach effectively resolves surface holes and discontinuities caused by noisy neural UDFs at high resolutions, significantly outperforming existing single-pass methods across multiple datasets.
How Many Tokens Do 3D Point Cloud Transformer Architectures Really Need?: This paper systematically demonstrates that 90–95% of tokens in 3D point cloud Transformers (e.g., PTv3, Sonata) are redundant, and proposes gitmerge3D — a globally informed graph-based token merging method that achieves up to 5.3× FLOPs reduction and 6.4× memory savings with negligible accuracy loss, via an energy-score-driven adaptive merging strategy.
Hybrid Physical-Neural Simulator for Fast Cosmological Hydrodynamics: This paper proposes a hybrid physical-neural cosmological simulator that handles gravitational dynamics via a differentiable particle-mesh (PM) method and parameterizes the effective gas pressure field using a physics-constrained neural network. The model requires only a single reference simulation for training and outperforms the EGD baseline at both the field level and the statistics level.
HyPlaneHead: Rethinking Tri-plane-like Representations in Full-Head Image Synthesis: This paper systematically analyzes three fundamental problems of tri-plane-like representations in 3D-aware head synthesis — mirror artifacts, non-uniform mapping, and feature penetration — and proposes a hybrid hy-plane representation (planar + spherical) combined with a unify-split strategy and near-equal-area warping, achieving state-of-the-art performance in full-head image synthesis.
HyRF: Hybrid Radiance Fields for Memory-efficient and High-quality Novel View Synthesis: This paper proposes Hybrid Radiance Fields (HyRF), which combines compact explicit Gaussians (storing only 8 parameters each) with decoupled grid-based neural fields, achieving 20× model compression while attaining state-of-the-art rendering quality and real-time performance.
IBGS: Image-Based Gaussian Splatting: This paper proposes Image-Based Gaussian Splatting (IBGS), which enhances standard 3DGS rendering quality by learning color residuals from neighboring training images. The method significantly improves the modeling of high-frequency details and view-dependent effects without introducing additional storage overhead.
IndEgo: A Dataset of Industrial Scenarios and Collaborative Work for Egocentric Assistants: This paper presents IndEgo — the first large-scale multimodal egocentric vision dataset targeting real industrial environments. It comprises 3,460 egocentric video clips (~197 hours) and 1,092 exocentric recordings (~97 hours), spanning five major task categories including assembly/disassembly, logistics, maintenance, woodworking, and miscellaneous tasks, as well as collaborative work scenarios. Three benchmarks are established: mistake detection, reasoning-based QA, and collaborative task understanding.
Instant Video Models: Universal Adapters for Stabilizing Image-Based Networks: This paper proposes a class of universal Stabilization Adapters that can be inserted into nearly any image model architecture. By freezing the base network and training only the adapter parameters, combined with a unified accuracy–stability–robustness loss function, the method endows frame-level models with video temporal consistency and corruption robustness.
Jasmine: Harnessing Diffusion Prior for Self-Supervised Depth Estimation: This paper is the first to incorporate the visual prior of Stable Diffusion into a self-supervised monocular depth estimation (SSMDE) framework. It proposes the Mix-Batch Image Reconstruction (MIR) proxy task to shield the SD prior from corruption by reprojection noise, and introduces the Scale-Shift GRU (SSG) to bridge the gap between SD's scale-shift-invariant (SSI) and self-supervised scale-invariant (SI) depth distributions. Jasmine achieves AbsRel = 0.090 on KITTI, establishing a new state of the art among all SSMDE methods, while comprehensively outperforming supervised SD methods such as Marigold, E2E FT, and Lotus in zero-shot generalization.
LangSplatV2: High-dimensional 3D Language Gaussian Splatting with 450+ FPS: By treating each 3D Gaussian as a sparse code over a global dictionary, LangSplatV2 replaces the heavyweight decoder with a sparse coefficient field, achieving 476.2 FPS high-dimensional feature splatting and 384.6 FPS 3D open-vocabulary querying — a 47× speedup over LangSplat.
Learning Efficient Fuse-and-Refine for Feed-Forward 3D Gaussian Splatting: This paper proposes a Fuse-and-Refine module that aggregates pixel-aligned Gaussian primitives into a coarse-to-fine voxel hierarchy via a hybrid Splat-Voxel representation. A sparse voxel Transformer fuses approximately 200K primitives within 15 ms, yielding ~2 dB PSNR improvement. The model is trained exclusively on static scenes yet generalizes zero-shot to streaming dynamic scene reconstruction.
Learning Neural Exposure Fields for View Synthesis: This paper proposes Neural Exposure Fields (NExF), which achieves 3D-consistent high-quality view synthesis by learning optimal exposure values per 3D point rather than per image. On HDR scenes, NExF surpasses the state-of-the-art by 3.5+ dB in PSNR while being 50× faster.
Linearly Constrained Diffusion Implicit Models: This paper proposes CDIM, a DDIM-based algorithm for solving linear inverse problems. By aligning the residual energy with the $\chi^2$ distribution of the forward diffusion process, CDIM adaptively controls the number and step size of projection steps, achieving inference speeds 10–50× faster than DPS while exactly satisfying measurement constraints in the noiseless case.
LinPrim: Linear Primitives for Differentiable Volumetric Rendering: This paper proposes LinPrim, which replaces 3D Gaussian kernels with linear primitives (octahedra and tetrahedra) as the scene representation for novel view synthesis. Through a differentiable rasterization pipeline, LinPrim enables end-to-end optimization and achieves reconstruction quality comparable to 3DGS on real-world datasets using fewer primitives, while maintaining real-time rendering capability.
Locality-Sensitive Hashing-Based Efficient Point Transformer for Charged Particle Reconstruction: By combining LSH with Point Transformer, the paper proposes HEPTv2 for end-to-end particle track reconstruction, eliminating the DBScan clustering post-processing bottleneck and achieving a 28.9× speedup while maintaining competitive tracking efficiency.
LODGE: Level-of-Detail Large-Scale Gaussian Splatting with Efficient Rendering: This paper proposes LODGE, which manages 3D Gaussian Splatting at multiple scales through a hierarchical Level-of-Detail (LOD) strategy. By dynamically selecting Gaussian representations of appropriate granularity based on camera distance, LODGE enables high-quality real-time rendering of large-scale scenes.
Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views: Look and Tell introduces a multimodal dataset that synchronously captures gaze, speech, and dual-view video from 25 participants in a kitchen environment using Meta Aria smart glasses and a fixed GoPro camera. Combined with 3D scene reconstruction and a multi-level annotation pipeline, it provides the first benchmark for studying referential communication across egocentric and exocentric perspectives.
MaNGO: Adaptable Graph Network Simulators via Meta-Learning: This paper proposes MaNGO (Meta Neural Graph Operator), which leverages meta-learning and conditional neural processes (CNP) to learn shared latent structure across simulation tasks under varying physical parameters, enabling rapid adaptation to new physical parameters without retraining.
MaterialRefGS: Reflective Gaussian Splatting with Multi-view Consistent Material Inference: MaterialRefGS is proposed to achieve high-fidelity novel view synthesis and accurate illumination decomposition for reflective surfaces, via multi-view consistent material inference constraints and a 2DGS ray-tracing-based environment modeling strategy.
Mesh-RFT: Enhancing Mesh Generation via Fine-Grained Reinforcement Fine-Tuning: This paper proposes Mesh-RFT, a framework that achieves face-level fine-grained mesh quality optimization through a topology-aware scoring system and Masked Direct Preference Optimization (M-DPO), significantly improving the geometric integrity and topological regularity of generated meshes.
Mesh Interpolation Graph Network for Dynamic and Spatially Irregular Global Weather Forecasting: This paper proposes MIGN, a framework that maps irregular weather station data onto a regular HEALPix mesh via a mesh interpolation strategy for message passing, and introduces parameterized spherical harmonics positional encoding to enhance spatial generalization, achieving significant improvements over existing methods on global weather forecasting tasks.
Meta-Learning an In-Context Transformer Model of Human Higher Visual Cortex: This paper proposes BraInCoRL (Brain In-Context Representation Learning), a Transformer-based meta-learning framework that predicts voxel-level neural responses for new subjects directly from a small number of stimulus–response samples via in-context learning (ICL), requiring no fine-tuning to generalize to new subjects or stimuli. With only 100 images as context, it approaches the performance of a reference model fully trained on 9,000 images.
MetaGS: A Meta-Learned Gaussian-Phong Model for Out-of-Distribution 3D Scene Relighting: MetaGS is proposed to achieve high-quality 3D scene relighting under out-of-distribution (OOD) lighting conditions by embedding a differentiable Blinn-Phong reflectance model into 3D Gaussian splatting and adopting a bilevel meta-learning training strategy.
Metropolis-Hastings Sampling for 3D Gaussian Reconstruction: This paper proposes an adaptive Metropolis-Hastings framework to replace the heuristic density control mechanism in 3DGS. Through probabilistic sampling driven by multi-view photometric error, it achieves more efficient inference of Gaussian distributions and converges faster than 3DGS-MCMC.
More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models: Merge proposes a plug-and-play framework that inserts lightweight learnable Converters before each frozen pretrained T2I diffusion block, enabling depth estimation with only ~12% additional parameters while perfectly preserving the original image generation capability. It achieves state-of-the-art performance among unified models on multiple zero-shot depth estimation benchmarks.
Motion4D: Learning 3D-Consistent Motion and Semantics for 4D Scene Understanding: Motion4D proposes a unified 4D Gaussian splatting framework that incorporates priors from 2D foundation models (semantic masks, point tracking, depth) into 3D representations via an iterative refinement strategy, achieving spatiotemporally consistent motion and semantic modeling. The method significantly outperforms existing approaches on video object segmentation, point tracking, and novel view synthesis tasks.
Motion Matters: Compact Gaussian Streaming for Free-Viewpoint Video Reconstruction: This paper proposes ComGS, a framework that exploits the locality and consistency of motion in dynamic scenes to drive the motion of all Gaussians in moving regions using only ~200 keypoints. ComGS achieves 159× storage compression over 3DGStream and 14× over QUEEN while maintaining competitive visual quality and rendering speed.
MPMAvatar: Learning 3D Gaussian Avatars with Accurate and Robust Physics-Based Dynamics: MPMAvatar integrates a Material Point Method (MPM) physics simulator with 3D Gaussian Splatting rendering. Through an anisotropic constitutive model and a novel collision handling algorithm for mesh-based colliders, it achieves accurate and robust physical animation of loose garments. On ActorsHQ and 4D-DRESS, it outperforms PhysAvatar across both geometry and appearance metrics, achieving a 100% simulation success rate vs. 37.6%, with a per-frame simulation time of only 1.1 seconds.
NerfBaselines: Consistent and Reproducible Evaluation of Novel View Synthesis Methods: This paper proposes NerfBaselines, an evaluation framework that addresses unfair comparisons in novel view synthesis (NVS) caused by inconsistent evaluation protocols. By wrapping original method code with a unified API and enforcing environment isolation, the framework ensures that each method's behavior exactly matches its original release. Experiments reveal that seemingly minor protocol differences—such as image resizing strategies and background colors—can significantly alter method rankings.
Neural Green's Functions: This paper proposes Neural Green's Functions, a learnable linear PDE solution operator based on eigendecomposition: pointwise geometric features are extracted from the domain geometry to predict the eigendecomposition of the Green's function, enabling one-time training to solve for arbitrary source functions and boundary conditions via numerical integration. On mechanical part thermal analysis, the method reduces error by 13.9% over the state-of-the-art neural operator while running 350× faster than numerical solvers.
Novel View Synthesis from A Few Glimpses via Test-Time Natural Video Completion: This paper reformulates sparse-input novel view synthesis as a test-time natural video completion problem. It leverages pretrained video diffusion models to generate intermediate pseudo-views, and iteratively optimizes 3D Gaussian Splatting (3D-GS) via an uncertainty-aware mechanism, achieving high-fidelity scene reconstruction under extremely sparse input conditions.
Object-Centric Representation Learning for Enhanced 3D Semantic Scene Graph Prediction: Through empirical analysis, this paper identifies object feature discriminability as the critical bottleneck in 3D scene graph predicate prediction (object misclassification accounts for 92%+ of predicate errors). It proposes an independently contrastively pre-trained object encoder (3D-2D-Text tri-modal alignment), a geometry-regularized relation encoder, and a bidirectional edge-gated GNN, achieving new SOTA on 3DSSG with Object R@1 59.53% and Predicate R@50 91.40%.
On Geometry-Enhanced Parameter-Efficient Fine-Tuning for 3D Scene Segmentation: This paper proposes the Geometry Encoding Mixer (GEM), a geometry-aware PEFT module designed for 3D point cloud Transformers. It captures fine-grained local geometric details via a Spatial Adapter and injects global scene context via a Context Adapter, achieving performance on par with or exceeding full fine-tuning while updating only 1.6% of parameters.
Online Segment Any 3D Thing as Instance Tracking: AutoSeg3D reformulates online 3D instance segmentation as an instance tracking problem, leveraging long-term memory for cross-frame instance association, short-term memory for instance update, and spatial consistency learning to mitigate VFM over-segmentation. The method surpasses ESAM by 2.8 AP on ScanNet200 while maintaining real-time performance.
OnlineSplatter: Pose-Free Online 3D Reconstruction for Free-Moving Objects: This paper proposes OnlineSplatter, a feed-forward online 3D reconstruction framework that requires no camera poses, depth priors, or global optimization. It achieves constant-time incremental reconstruction of free-moving objects via a dual-key memory module combining appearance-geometry latent keys and orientation keys.
OpenLex3D: A Tiered Evaluation Benchmark for Open-Vocabulary 3D Scene Representations: This paper proposes OpenLex3D, a tiered evaluation benchmark for open-vocabulary 3D scene representations. Built upon Replica, ScanNet++, and HM3D, it provides language annotations 13× richer than the original labels, supporting evaluation on two tasks: open-set 3D semantic segmentation and object retrieval.
Orientation-anchored Hyper-Gaussian for 4D Reconstruction from Casual Videos: This paper proposes OriGS (Orientation-anchored Gaussian Splatting), which achieves high-quality 4D dynamic scene reconstruction from casually captured monocular videos via a global orientation field and an orientation-aware hyper-Gaussian representation.
Orientation Matters: Making 3D Generative Models Orientation-Aligned: This paper introduces the task of orientation-aligned 3D object generation, constructs the Objaverse-OA dataset comprising 14,832 orientation-aligned 3D models across 1,008 categories, fine-tunes two mainstream 3D generation frameworks (Trellis and Wonder3D) to achieve orientation-aligned object generation, and demonstrates two downstream applications: zero-shot orientation estimation and arrow-guided rotation manipulation.
PhysX-3D: Physical-Grounded 3D Asset Generation: PhysX proposes the first end-to-end physical-property-driven 3D asset generation paradigm, comprising PhysXNet (the first 3D dataset with systematic annotations across five physical dimensions—absolute scale, material, functional affordance, kinematics, and functional description—covering 26K+ objects) and PhysXGen (a dual-branch feed-forward generation framework that injects physical knowledge into a pretrained 3D structural latent space).
Pixel-Perfect Depth with Semantics-Prompted Diffusion Transformers: This paper proposes Pixel-Perfect Depth, a monocular depth estimation model that performs diffusion generation directly in pixel space (rather than latent space). Through a Semantics-Prompted DiT (SP-DiT) that incorporates high-level semantic representations from visual foundation models and a cascaded DiT design, the model generates flying-pixel-free depth maps, surpassing all published generative models on five benchmarks.
Plana3R: Zero-shot Metric Planar 3D Reconstruction via Feed-Forward Planar Splatting: This paper proposes Plana3R, a feed-forward framework that requires neither camera poses nor planar annotations, predicting sparse 3D planar primitives and metric-scale relative poses from unpaired two-view images for zero-shot metric planar 3D reconstruction of indoor scenes.
PlanarGS: High-Fidelity Indoor 3D Gaussian Splatting Guided by Vision-Language Planar Priors: PlanarGS detects planar regions via a vision-language foundation model (GroundedSAM) with text prompts, combines multi-view depth priors from DUSt3R, and optimizes 3DGS through coplanarity constraints and geometric prior supervision to achieve high-fidelity surface reconstruction in indoor scenes.
PointMAC: Meta-Learned Adaptation for Robust Test-Time Point Cloud Completion: PointMAC is the first framework to introduce meta-auxiliary learning and test-time adaptation (TTA) into point cloud completion. It leverages Bi-Aux Units (random masked reconstruction + denoising) as self-supervised signals, employs MAML to align auxiliary objectives with the primary task, and at inference updates only the shared encoder for sample-level refinement, achieving state-of-the-art performance on synthetic, simulated, and real-world data.
Quantifying and Alleviating Co-Adaptation in Sparse-View 3D Gaussian Splatting: This paper identifies co-adaptation among Gaussians as the root cause of appearance artifacts in sparse-view 3D Gaussian Splatting, proposes the Co-Adaptation Score (CA) metric to quantify this entanglement, and introduces two plug-and-play regularization strategies—Gaussian Dropout and multiplicative opacity noise injection—that consistently reduce co-adaptation and improve novel view rendering quality across five baseline methods and three datasets.
Reconstruct, Inpaint, Test-Time Finetune: Dynamic Novel-View Synthesis from Monocular Videos: This paper proposes CogNVS, which decomposes dynamic scene novel-view synthesis into a three-stage pipeline — 3D reconstruction (recovering visible pixels) → video diffusion inpainting (generating occluded regions) → test-time finetuning (adapting to the target video distribution) — training the inpainting model with purely 2D video self-supervision to achieve zero-shot generalization to new test videos.
Reconstructing the Local Density Field with Combined Convolutional and Point Cloud Architecture: This paper proposes a hybrid neural network architecture combining convolutional networks (U-Net) and point cloud networks (DeepSets) to reconstruct the local dark matter density field from line-of-sight peculiar velocities of dark matter halos, achieving significant improvements over purely convolutional and linear reconstruction methods at small scales.
Rectified Point Flow: Generic Point Cloud Pose Estimation: This paper proposes Rectified Point Flow, a unified generative framework that reformulates pairwise point cloud registration and multi-part shape assembly as a conditional generation problem, estimating part poses by learning a continuous point-wise velocity field.
RGB-Only Supervised Camera Parameter Optimization in Dynamic Scenes: ROS-Cam proposes a camera parameter (focal length + pose) optimization method for dynamic scenes supervised solely by a single RGB video. It achieves state-of-the-art accuracy and fastest runtime across 5 datasets via three key contributions: a patch-wise tracking filter for sparse, robust correspondences; a Cauchy distribution-based outlier-aware joint optimization that adaptively down-weights moving objects; and a two-stage optimization strategy grounded in Softplus/convex minimax analysis.
RigAnyFace: Scaling Neural Facial Mesh Auto-Rigging with Unlabeled Data: This paper proposes RigAnyFace (RAF), a scalable facial mesh auto-rigging framework that leverages a 2D supervision strategy to exploit unlabeled neutral meshes for training scale expansion, enabling high-quality FACS blendshape rigging across diverse topologies and disconnected components (e.g., eyeballs).
Robust Neural Rendering in the Wild with Asymmetric Dual 3D Gaussian Splatting: AsymGS leverages a key observation—that reconstruction artifacts caused by in-the-wild training data are stochastic in nature—and proposes an asymmetric dual 3DGS framework that suppresses artifacts via complementary masking strategies and consistency constraints. A Dynamic EMA Proxy is introduced for efficient training, achieving significant improvements over existing methods on multiple in-the-wild benchmarks.
ROGR: Relightable 3D Objects using Generative Relighting: This paper proposes ROGR, which leverages a multi-view diffusion relighting model to generate consistent images under multiple lighting conditions, trains a lighting-conditioned NeRF on the resulting dataset, and achieves feed-forward 3D object relighting under arbitrary environment lighting. ROGR attains state-of-the-art performance on the TensoIR and Stanford-ORB benchmarks while supporting interactive rendering.
Scaffold Diffusion: Sparse Multi-Category Voxel Structure Generation with Discrete Diffusion: This paper proposes Scaffold Diffusion, which treats sparse multi-category 3D voxels as token sequences and employs a Masked Diffusion Language Model (MDLM) with 3D sinusoidal positional encodings to generate spatially coherent multi-category voxel structures conditioned on occupancy maps. The method substantially outperforms autoregressive and conventional discrete diffusion baselines on an extremely sparse (>98% background) Minecraft house dataset.
Scalable Diffusion Transformer for Conditional 4D fMRI Synthesis: This paper proposes the first diffusion Transformer for voxel-level whole-brain 4D fMRI conditional generation, combining 3D VQ-GAN latent space compression, a CNN-Transformer hybrid backbone, and strong conditioning via AdaLN-Zero and cross-attention. The model achieves a task activation map correlation of 0.83, RSA of 0.98, and perfect condition specificity across seven cognitive tasks from the HCP dataset.
SceneForge: Enhancing 3D-text alignment with Structured Scene Compositions: This paper proposes SceneForge, a framework that composes individual 3D point cloud objects into multi-object scenes with explicit spatial relations, paired with LLM-refined compositional captions, to enhance data diversity and complexity for 3D-text contrastive learning, yielding consistent performance gains across multiple downstream tasks.
SceneWeaver: All-in-One 3D Scene Synthesis with an Extensible and Self-Reflective Agent: This paper proposes SceneWeaver, the first reflective agent framework for 3D scene synthesis, which unifies multiple scene generation paradigms through a standardized and extensible tool interface. By employing a reason-act-reflect closed-loop for iterative refinement, it comprehensively outperforms existing methods in physical plausibility, visual realism, and semantic alignment.
Segment then Splat: Unified 3D Open-Vocabulary Segmentation via Gaussian Splatting: This paper proposes a novel "Segment-then-Splat" paradigm that assigns Gaussians to distinct object sets prior to 3D reconstruction, thereby eliminating geometric and semantic ambiguity and enabling unified 3D open-vocabulary segmentation for both static and dynamic scenes.
Shallow Flow Matching for Coarse-to-Fine Text-to-Speech Synthesis: This paper proposes Shallow Flow Matching (SFM), which leverages weak generator outputs to construct intermediate states within a flow matching framework for coarse-to-fine TTS. Inference begins from these intermediate states rather than pure noise, simultaneously improving synthesis quality and accelerating inference.
SingRef6D: Monocular Novel Object Pose Estimation with a Single RGB Reference: SingRef6D is a lightweight 6D pose estimation pipeline requiring only a single RGB reference image. It fine-tunes Depth-Anything v2 via a token-scaler mechanism for robust depth prediction and introduces depth-aware matching to enhance LoFTR's spatial reasoning, substantially outperforming existing methods on transparent and reflective objects.
SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation: This paper introduces the concept of Semantic Orientation, which describes object directions using natural language (e.g., the "insertion direction" of a USB plug or the "handle direction" of a cup). It constructs the large-scale OrienText300K dataset to train the PointSO model for zero-shot orientation prediction, and integrates these components into the SoFar system for 6-DoF scene understanding and robotic manipulation.
Styl3R: Instant 3D Stylized Reconstruction for Arbitrary Scenes and Styles: This paper proposes Styl3R, a feed-forward network that decouples 3D reconstruction from stylization via a structure–appearance dual-branch architecture, enabling stylized 3D reconstruction from uncalibrated sparse-view images and arbitrary style images in 0.15 seconds.
SyncHuman: Synchronizing 2D and 3D Generative Models for Single-View Human Reconstruction: SyncHuman is the first framework to unify 2D multi-view generative models and 3D native generative models within a single pipeline. Through pixel-aligned 2D-3D synchronized attention, the two branches mutually enhance each other, achieving high-fidelity textured mesh reconstruction under complex human poses and surpassing existing methods in both geometric accuracy and visual quality.
TAPIP3D: Tracking Any Point in Persistent 3D Geometry: This paper proposes TAPIP3D, which represents video as a camera-stabilized spatiotemporal 3D feature point cloud and iteratively refines multi-frame point trajectories in persistent 3D geometric space via a 3D Neighborhood-to-Neighborhood (N2N) attention mechanism, substantially outperforming existing 3D point tracking methods.
Temporal Smoothness-Aware Rate-Distortion Optimized 4D Gaussian Splatting: This paper proposes the first end-to-end rate-distortion (RD) optimized compression framework for 4D Gaussian Splatting. By exploiting the temporal smoothness prior of dynamic point trajectories via Haar wavelet transforms, the method achieves up to 91× compression over Ex4DGS (average model size ~1.1% of the original) while maintaining reasonable rendering quality and flexible rate-quality trade-off control.
Towards 3D Objectness Learning in an Open World: This paper proposes OP3Det, a class-agnostic open-world 3D detector that requires no text prompts. It leverages 2D foundation models for 3D object discovery and introduces a cross-modal Mixture-of-Experts (MoE) module to dynamically fuse point cloud and image features, substantially improving recall on novel object categories.
TP-MDDN: Task-Preferenced Multi-Demand-Driven Navigation with Autonomous Decision-Making: This paper proposes the Task-Preferenced Multi-Demand-Driven Navigation (TP-MDDN) benchmark and the AWMSystem autonomous decision-making framework, which achieves long-horizon multi-subtask navigation through three LLM modules—instruction decomposition, dynamic goal selection, and task status monitoring—coupled with a multi-dimensional cumulative semantic map.
TRIM: Scalable 3D Gaussian Diffusion Inference with Temporal and Spatial Trimming: This paper proposes TRIM (Trajectory Reduction and Instance Mask denoising), a post-training framework that accelerates 3D Gaussian diffusion model inference while improving generation quality through temporal trajectory pre-selection and spatial background token pruning. TRIM outperforms baselines such as DiffSplat on both T3Bench text-to-3D and GSO image-to-3D benchmarks.
U-CAN: Unsupervised Point Cloud Denoising with Consistency-Aware Noise2Noise Matching: This paper proposes U-CAN, an unsupervised point cloud denoising framework that infers multi-step denoising paths via a Noise2Noise matching scheme and geometric consistency constraints. The method approaches supervised performance and demonstrates that the consistency constraint generalizes to 2D image denoising.
UGM2N: An Unsupervised and Generalizable Mesh Movement Network via M-Uniform Loss: UGM2N is an unsupervised mesh movement network that achieves zero-shot generalization across PDE types and mesh geometries—without requiring pre-adapted mesh data—through a localized Node Patch representation and an M-Uniform loss function, while guaranteeing freedom from mesh tangling.
UMAMI: Unifying Masked Autoregressive Models and Deterministic Rendering for View Synthesis: This paper proposes UMAMI, a hybrid framework that unifies Masked Autoregressive Models (MAR) and deterministic rendering for sparse-view novel view synthesis. A bidirectional Transformer encodes multi-view image tokens and Plücker ray embeddings; two lightweight MLP heads handle visible regions (deterministic regression) and occluded regions (MAR diffusion generation) respectively. The rendering speed is an order of magnitude faster than fully generative baselines.
URDF-Anything: Constructing Articulated Objects with 3D Multimodal Language Model: This paper proposes URDF-Anything, the first end-to-end articulated object reconstruction framework based on a 3D Multimodal Large Language Model (MLLM). By introducing a [SEG] token mechanism, the framework jointly predicts geometric segmentation and kinematic parameters, achieving state-of-the-art performance in segmentation accuracy (mIoU +17%), parameter error (−29%), and physical executability (surpassing baselines by 50%).
VA-GS: Enhancing the Geometric Representation of Gaussian Splatting via View Alignment: By introducing four view alignment strategies — edge-aware image supervision, visibility-aware multi-view photometric alignment, normal consistency constraints, and depth image feature alignment — VA-GS significantly improves the geometric representation accuracy of 3D Gaussian Splatting, achieving state-of-the-art performance in surface reconstruction and novel view synthesis.
VisualSync: Multi-Camera Synchronization via Cross-View Object Motion: VisualSync presents a multi-camera temporal synchronization framework grounded in epipolar geometry constraints. By leveraging pretrained vision foundation models (VGGT, CoTracker3, MAST3R) to extract motion trajectories and cross-view correspondences, the method estimates per-camera temporal offsets by minimizing Sampson error, achieving millisecond-level synchronization with median errors below 50 ms across four benchmarks.
Walking the Schrödinger Bridge: A Direct Trajectory for Text-to-3D Generation: This paper theoretically establishes SDS as a special case of the Schrödinger Bridge, and builds upon this insight to propose TraCe — a framework that constructs an explicit diffusion bridge between the current rendering and the text-conditioned target, learns the score dynamics along the bridge trajectory via LoRA fine-tuning, and achieves high-quality text-to-3D generation at low CFG values.
WildCAT3D: Appearance-Aware Multi-View Diffusion in the Wild: This paper proposes WildCAT3D, which extends the multi-view diffusion model CAT3D to learn scene-level novel view synthesis from in-the-wild internet data (e.g., tourist photographs) by explicitly modeling global appearance conditions, while simultaneously supporting appearance-controlled generation.
ZPressor: Bottleneck-Aware Compression for Scalable Feed-Forward 3DGS: Grounded in the Information Bottleneck (IB) principle, this paper analyzes the capacity bottleneck of feed-forward 3DGS and proposes ZPressor, a lightweight, architecture-agnostic module that compresses multi-view inputs into a compact anchor-view representation, enabling existing models to scale to 100+ input views (480P, 80GB GPU) with consistent performance gains on DL3DV-10K and RealEstate10K.

📐 Optimization & Theory¶

A Single-Loop First-Order Algorithm for Linearly Constrained Bilevel Optimization: For bilevel optimization problems with coupled linear constraints in the lower-level problem, this paper proposes SFLCB, a single-loop first-order algorithm that eliminates Hessian dependence via a penalty-based reformulation combined with augmented Lagrangian, improving the iteration complexity from $O(\epsilon^{-3}\log(\epsilon^{-1}))$ to $O(\epsilon^{-3})$.
A Theoretical Study on Bridging Internal Probability and Self-Consistency for LLM Reasoning: This paper proposes the first theoretical framework for sampling-based test-time scaling methods, decomposing reasoning error into estimation error and model error. It reveals the limitations of Self-Consistency (slow convergence) and Perplexity (large model error), and introduces the RPC method that combines the strengths of both, achieving comparable reasoning performance on 7 benchmarks with only 50% of the sampling cost.
A Unified Approach to Submodular Maximization Under Noise: This paper proposes a unified meta-algorithm framework that takes any exact submodular maximization algorithm satisfying a "robustness" condition as a black box and automatically converts it into an algorithm that maintains its approximation ratio (losing only $o(1)$) under a persistent noisy value oracle. This achieves, for the first time, optimal approximation ratios for non-monotone submodular maximization under matroid constraints and in the unconstrained setting.
A Unified Stability Analysis of SAM vs SGD: Role of Data Coherence and Emergence of Simplicity Bias: Through a linear stability analysis framework, this paper demonstrates that "flat minima ⇒ better generalization" and "SGD prefers simple functions" are two sides of the same coin — data coherence simultaneously governs both phenomena, and SAM amplifies the simplicity bias further by imposing stricter stability conditions.
Adaptive Algorithms with Sharp Convergence Rates for Stochastic Hierarchical Optimization: This paper proposes Ada-Minimax and Ada-BiO, two adaptive algorithms that combine momentum normalization with a novel online noise estimation strategy to achieve, for the first time, sharp convergence rates of $\tilde{O}(1/\sqrt{T} + \sqrt{\bar{\sigma}}/T^{1/4})$ for nonconvex-strongly-concave minimax and nonconvex-strongly-convex bilevel optimization without requiring prior knowledge of the gradient noise level.
An Adaptive Algorithm for Bilevel Optimization on Riemannian Manifolds: AdaRHD is the first adaptive algorithm for Riemannian bilevel optimization (RBO) that requires no prior knowledge of problem-specific parameters (strong convexity constants, Lipschitz bounds, manifold curvature). By adopting an inverse cumulative gradient norm strategy for adaptive step size selection and solving the lower-level problem, linear system, and upper-level update sequentially within a three-stage framework, AdaRHD achieves a convergence rate of $O(1/\epsilon)$ matching non-adaptive methods, while exhibiting substantially greater robustness to initial step size choices compared to RHGD.
Asymptotically Stable Quaternionic Hopfield Structured Neural Network with Supervised Projection-based Manifold Learning: This paper proposes a Quaternion-valued Supervised Hopfield-structured Neural Network (QSHNN) that employs a periodic projection strategy to maintain the quaternionic structural consistency of the weight matrix. The existence and uniqueness of fixed points and their asymptotic stability are established via Lyapunov theory, while bounded trajectory curvature guarantees path smoothness for robotic path planning.
Auto-Compressing Networks: Auto-Compressing Networks (ACN) replace short residual connections with long-range forward connections (aggregating all layer outputs directly into the final output), making the Direct Gradient (DG) component significantly stronger than the Forward Gradient (FG), thereby implicitly compressing information into earlier layers. A ViT with only 6 layers matches standard 12-layer performance; BERT saves 75% of its layers; additional benefits include noise robustness (+6.4%) and reduced catastrophic forgetting in continual learning (−18%).
Automated Algorithm Design via Nevanlinna-Pick Interpolation: This paper proposes an automated algorithm design framework based on Nevanlinna-Pick interpolation from frequency-domain robust control theory, targeting strongly convex optimization with equality constraints, and achieves an optimal trade-off between the number of matrix-vector multiplications and the convergence rate.
AutoOpt: A Dataset and a Unified Framework for Automating Optimization Problem Solving: AutoOpt introduces the first end-to-end framework for converting optimization problem images to executable code — comprising the AutoOpt-11k dataset of 11,554 optimization formula images (handwritten + printed), an M1 hybrid encoder (ResNet+Swin→mBART) for image-to-LaTeX conversion (BLEU 96.70), an M2 DeepSeek-Coder module for LaTeX-to-PYOMO translation, and an M3 bilevel decomposition solver, achieving an overall pipeline success rate of 94.20%.
Better NTK Conditioning: A Free Lunch from ReLU Nonlinear Activation in Wide Neural Networks: This paper establishes a previously unnoticed "free" benefit of ReLU activation in wide neural networks: (a) it induces better data separation in the model's gradient feature space (angles between similar inputs are amplified in gradient space), and (b) this strictly reduces the condition number of the NTK matrix compared to linear networks. Depth further amplifies this effect — in the infinite-width-then-infinite-depth limit, all data pairs achieve equal angular separation in gradient space (~75.5°), and the NTK condition number converges to a fixed value $(n+4)/3$ that depends only on the number of training samples $n$.
Beyond Õ(√T) Constraint Violation for Online Convex Optimization with Adversarial Constraints: This paper studies Constrained Online Convex Optimization (COCO) with adversarial constraints and introduces a tunable parameter $\beta$ to achieve a precise tradeoff between $\tilde{O}(T^\beta)$ regret and $\tilde{O}(T^{1-\beta})$ cumulative constraint violation (CCV), surpassing the previously known optimal bound of $\tilde{O}(\sqrt{T})$ constraint violation.
Brain-like Variational Inference: This paper proposes the FOND framework (Free energy Online Natural-gradient Dynamics), which derives spiking neural network inference dynamics from first principles via free energy minimization, and implements iPVAE (iterative Poisson VAE). iPVAE outperforms standard VAEs and predictive coding models in reconstruction–sparsity trade-off, biological plausibility, and OOD generalization.
Clean First, Align Later: Benchmarking Preference Data Cleaning for Reliable LLM Alignment: This paper introduces PrefCleanBench, the first comprehensive benchmark for systematically evaluating 13 preference data cleaning methods in the context of LLM alignment. It covers diverse datasets, model architectures, and optimization algorithms, revealing the underappreciated yet critical role of data preprocessing in responsible AI development.
Composing Global Solutions to Reasoning Tasks via Algebraic Objects in Neural Nets: This paper proposes the CoGS framework, demonstrating that the weight space of two-layer quadratic-activation networks on Abelian group multiplication reasoning tasks admits a semiring algebraic structure. The Sum Potentials in the loss function are ring homomorphisms, enabling global optimal solutions to be algebraically composed from partial solutions—each satisfying only a subset of loss constraints—via ring addition and multiplication. Approximately 95% of gradient descent solutions match the theoretical constructions exactly.
Composing Global Solutions to Reasoning Tasks via Algebraic Objects in Neural Nets: This work reveals that the weight space of two-layer quadratic-activation networks trained on Abelian group reasoning tasks possesses a semiring algebraic structure, and proposes the CoGS framework that composes partial solutions into globally optimal solutions via ring operations. Approximately 95% of gradient descent solutions match the theoretical constructions exactly.
Constrained Network Slice Assignment via Large Language Models: This paper investigates the use of LLMs (Claude series) for solving constrained optimization problems in 5G network slice resource allocation. Two approaches are proposed: zero-shot LLM direct assignment and LLM-guided integer programming. Empirical findings show that LLMs alone can produce reasonable initial allocations but may violate hard constraints, whereas combining LLMs with an ILP solver achieves 100% completeness and balanced utilization.
Contribution of Task-Irrelevant Stimuli to Drift of Neural Representations: This work theoretically demonstrates that the statistical properties (variance and dimensionality) of task-irrelevant stimuli are key drivers of representational drift in online learning. Across Oja's rule, Similarity Matching, autoencoders, and supervised two-layer networks, a drift rate of $D \propto \lambda_\perp^2 (n-m)$ is consistently observed. Furthermore, learning-noise-induced drift exhibits anisotropic geometric structure, qualitatively distinct from the isotropic drift induced by Gaussian synaptic noise.
Covariances for Free: Exploiting Mean Distributions for Training-free Federated Learning: This paper proposes FedCOF, which leverages only the class means uploaded by clients to perform unbiased estimation of class covariance matrices on the server side, enabling initialization of a global classifier with zero training and minimal communication overhead — achieving performance on par with or surpassing Fed3R, which requires transmission of second-order statistics.
DartQuant: Efficient Rotational Distribution Calibration for LLM Quantization: DartQuant proposes a distribution-calibration-based rotation matrix optimization method that drives activation distributions toward uniformity via a Whip loss to reduce quantization error, and replaces expensive manifold optimizers with QR-Orth, achieving 47× speedup and 10× memory reduction on 70B models — enabling large-model rotation calibration on a single RTX 3090 for the first time.
Deep Taxonomic Networks for Unsupervised Hierarchical Prototype Discovery: Deep Taxonomic Networks proposes a deep latent variable model with a complete binary tree Gaussian mixture prior. Through variational inference, the model automatically discovers hierarchical taxonomies and multi-level prototype clusters from unlabeled data without requiring a predefined number of classes, substantially outperforming baselines such as TreeVAE across multiple datasets.
Do Neural Networks Need Gradient Descent to Generalize? A Theoretical Study: This paper establishes, within the matrix factorization framework (a canonical theoretical testbed for neural networks), that Guess & Check (G&C; randomly sampling parameters until the training set is fit) exhibits generalization that degrades with increasing width (the first provable demonstration of a canonical setting where G&C is strictly inferior to gradient descent) yet improves with increasing depth, revealing the fundamentally opposite roles that width and depth play in generalization.
Doubly Robust Alignment for Large Language Models: DRPO draws on doubly robust estimation from causal inference to propose a preference optimization algorithm that maintains consistency whenever either the preference model or the reference policy is correctly specified, outperforming PPO/DPO and their variants both theoretically and empirically.
DynaAct: Large Language Model Reasoning with Dynamic Action Spaces: DynaAct frames action space construction in LLM reasoning as a subset selection problem, dynamically assembling a compact action space at each step via a submodular function that balances utility and diversity. It outperforms rStar, RAP, and other baselines on 6 benchmarks, surpassing rStar by 6.8% on MATH-500.
Effective Policy Learning for Multi-Agent Online Coordination Beyond Submodular Objectives: This paper proposes two multi-agent online coordination algorithms, MA-SPL and MA-MPL, which leverage a policy-based continuous extension technique to surpass the limitations of submodularity. For the first time, both algorithms achieve the optimal $(1 - c/e)$ approximation ratio under submodular and weakly submodular objectives, while supporting time-varying objectives and the practical constraint of local-only feedback.
Efficient Adaptive Experimentation with Noncompliance: This paper proposes AMRIV — the first semiparametrically efficient, multiply robust ATE estimator for adaptive experiments with noncompliance — combined with a variance-optimal instrumental variable allocation strategy and sequential inference guarantees.
Efficient Adaptive Federated Optimization: FedAda2/FedAda2++ proposes efficient server–client joint adaptive optimization for federated learning: client-side local preconditioners are initialized from zero (eliminating server-to-client transmission), with an optional SM3-based memory-efficient compression of local statistics. The method is theoretically shown to achieve the same $O(T^{-1/2})$ convergence rate as full joint adaptivity, while incurring the same communication cost as FedAvg.
Efficient Federated Learning against Byzantine Attacks and Data Heterogeneity via Aggregating Normalized Gradients: This paper proposes Fed-NGA, an algorithm that aggregates client gradients via weighted averaging after $\ell_2$ normalization, achieving Byzantine robustness and resilience to data heterogeneity simultaneously at an extremely low time complexity of $\mathcal{O}(pM)$. Under non-convex loss functions, it is the first to prove zero optimality gap convergence under certain mild conditions.
Emergence and Scaling Laws in SGD Learning of Shallow Neural Networks: This paper provides a precise analysis of online SGD learning of additive models (sums of single-index functions) in shallow neural networks, proving that the learning of each teacher neuron exhibits a sharp phase transition (emergence), and that the superposition of many such transition curves across different timescales naturally produces a smooth power-law scaling law.
Escaping Saddle Points without Lipschitz Smoothness: The Power of Nonlinear Preconditioning: This paper proposes a unified sufficient condition connecting the $(L_0,L_1)$-smoothness and anisotropic smoothness frameworks, proves that nonlinear preconditioned gradient methods (including gradient clipping variants) retain saddle-point avoidance under these relaxed conditions, and establishes that a perturbed variant achieves second-order stationary points with polylogarithmic dimension dependence.
Estimation of Stochastic Optimal Transport Maps: This paper proposes a transport error metric $\mathcal{E}_p$ for stochastic OT maps—decomposed into an optimality gap and a feasibility gap—and constructs a computationally efficient rounding estimator achieving a near-optimal convergence rate of $\tilde{O}(n^{-1/(d+2p)})$ under minimal assumptions that require neither the existence nor uniqueness of Brenier maps. The framework is further extended to Hölder-continuous kernels and adversarially corrupted settings, establishing the first general theory for OT map estimation.
Evaluating LLMs for Combinatorial Optimization: One-Phase and Two-Phase Heuristics for 2D Bin-Packing: This paper proposes a systematic evaluation framework combining LLMs with evolutionary algorithms to assess the capability of LLMs in generating and optimizing heuristics for the 2D bin-packing problem. GPT-4o achieves optimal solutions within 2 iterations, reducing the average number of bins from 16 to 15 and improving space utilization from 0.76–0.78 to 0.83.
Exact and Linear Convergence for Federated Learning under Arbitrary Client Participation is Attainable: This paper introduces stochastic matrices and time-varying graphs as modeling tools to unify client participation and local update processes in federated learning into a matrix multiplication formulation. It proposes the FOCUS algorithm (based on a push-pull strategy), which achieves, for the first time, exact convergence and linear convergence rates under arbitrary client participation and data heterogeneity.
Exploring Landscapes for Better Minima along Valleys: This paper proposes an optimizer adapter "E" that incorporates an exponential moving average of gradient differences $\mathbf{a}_k = \text{EMA}(\mathbf{g}_k - \mathbf{g}_{k-1})$ into the gradient update, enabling optimizers to continue exploring "valleys" in the loss landscape for lower and flatter minima after reaching a local minimum. The resulting ALTO optimizer achieves an average improvement of 2.5% in test accuracy under large-batch training.
Extragradient Method for $(L_0, L_1)$-Lipschitz Root-finding Problems: Under the $\alpha$-symmetric $(L_0,L_1)$-Lipschitz condition (a relaxation of the classical $L$-Lipschitz assumption), this paper proposes an adaptive step size strategy $\gamma_k = 1/(c_0 + c_1\|F(x_k)\|^\alpha)$ for the extragradient (EG) method, establishing the first complete convergence guarantees for three classes of root-finding problems: strongly monotone (linear convergence), monotone (sublinear convergence), and weak Minty (local convergence).
Faster Algorithms for Structured John Ellipsoid Computation: For computing the John ellipsoid of a symmetric convex polytope $P = \{x \in \mathbb{R}^d : -\mathbf{1}_n \leq Ax \leq \mathbf{1}_n\}$, this paper proposes two fast algorithms: a near-input-sparsity algorithm based on sketching with per-iteration cost $\widetilde{O}(\text{nnz}(A) + d^\omega)$, and a treewidth-based algorithm with per-iteration cost $O(n\tau^2)$, both significantly improving upon the prior state-of-the-art cost of $O(nd^2)$.
FedQS: Optimizing Gradient and Model Aggregation for Semi-Asynchronous Federated Learning: This paper proposes FedQS, the first framework to simultaneously optimize both gradient aggregation and model aggregation strategies in semi-asynchronous federated learning (SAFL). By partitioning clients into four categories and adaptively adjusting training strategies, FedQS achieves comprehensive improvements over baselines in accuracy, convergence speed, and stability.
FedRTS: Federated Robust Pruning via Combinatorial Thompson Sampling: This paper reformulates federated dynamic pruning as a Combinatorial Multi-Armed Bandit (CMAB) problem and proposes TSAdj, a Thompson Sampling-based topology adjustment mechanism. By replacing deterministic decisions with probabilistic ones, the method obtains more robust sparse model topologies while significantly reducing communication overhead.
Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers: This paper argues that LLM generalization and hallucination share a common mechanism — out-of-context reasoning (OCR) — and provides theoretical guarantees on a single-layer attention model: the factorized parameterization $(W_O, W_V)$ can perform OCR due to the nuclear norm implicit bias of gradient descent, whereas the merged parameterization $W_{OV}$ cannot due to its Frobenius norm bias. Moreover, OCR is sample-efficient (requiring only $m_{\text{train}}>0$).
Gradient Descent as Loss Landscape Navigation: a Normative Framework for Deriving Learning Rules: This paper proposes treating learning rules as optimal navigation policies in a (partially observable) loss landscape. By solving a continuous-time optimal control problem via variational calculus, it derives gradient descent, momentum, natural gradient, Adam, and continual learning strategies within a unified framework.
Implicit Bias of Spectral Descent and Muon on Multiclass Separable Data: This paper provides the first complete characterization of the implicit bias of Normalized Steepest Descent (NSD) and Normalized Momentum Descent (NMD) on multiclass linearly separable data: these algorithms converge to the maximum-margin solution under the corresponding $p$-norm at a rate of $\mathcal{O}(1/\sqrt{t})$, with Spectral Descent (spectral norm) and Muon as special cases, and further extended to Adam (max-norm margin).
Improving the Straight-Through Estimator with Zeroth-Order Information: This paper proposes FOGZO (First-Order-Guided Zeroth-Order Gradient Descent), which injects STE gradients as a bias source into zeroth-order gradient estimation. By retaining the computational efficiency of STE while leveraging zeroth-order information to correct occasional erroneous gradient directions, FOGZO achieves 1–22 point improvements in accuracy/perplexity on DeiT, ResNet, and LLaMA with only 2 additional forward passes.
In Search of Adam's Secret Sauce: Through large-scale experiments training 1500+ language models, this paper establishes: (1) Signum closes 96% of the SGD–Adam gap yet remains 25% slower than Adam; (2) setting $\beta_1 = \beta_2$ is a near-optimal simplification of Adam; (3) under $\beta_1 = \beta_2 = \beta$, Adam can be reinterpreted as a signal-to-noise ratio–adaptive Signum that estimates the gradient mean and variance via online Gaussian variational inference.
Isotropic Noise in Stochastic and Quantum Convex Optimization: This paper introduces the concept of an Isotropic Stochastic Gradient Oracle (ISGO)—where noise is bounded in every direction with high probability—and designs a stochastic cutting-plane algorithm achieving a query complexity of $\tilde{O}(R^2\sigma_I^2/\epsilon^2 + d)$, improving over SGD by a factor of $d$ in certain parameter regimes. As corollaries, the paper establishes new state-of-the-art complexities under sub-exponential noise and improves the dimension dependence of quantum stochastic convex optimization via a quantum isotropization subroutine.
Kernel Learning with Adversarial Features: Numerical Efficiency and Adaptive Regularization: This paper proposes a new paradigm that relocates adversarial perturbations from the input space to the feature space within reproducing kernel Hilbert spaces (RKHS), enabling exact closed-form solutions to the inner maximization problem. The outer minimization is solved efficiently via iterative reweighted kernel ridge regression, and the resulting adaptive regularization matches cross-validation performance without any hyperparameter tuning.
Large Language Bayes: This work mathematically "glues" an LLM and a probabilistic programming language (PPL/Stan) into a joint distribution $p(z,x,m|t) = p(m|t)_{\text{LLM}} \cdot p(z,x|m)_{\text{PPL}}$. Given only an informal problem description and data, the system automatically samples candidate formal models from the LLM, performs Bayesian inference within each model, and produces a marginal-likelihood-weighted model average — requiring no user-written probabilistic model.
Large Stepsizes Accelerate Gradient Descent for Regularized Logistic Regression: This paper proves that applying GD with large stepsizes (entering the Edge of Stability regime) to $\ell_2$-regularized logistic regression on linearly separable data accelerates the step complexity from the classical $\widetilde{O}(\kappa)$ to $\widetilde{O}(\sqrt{\kappa})$, matching the acceleration rate of Nesterov momentum in the small-regularization regime.
Layer-wise Update Aggregation with Recycling for Communication-Efficient Federated Learning: This paper proposes FedLUAR, which uses a gradient-to-weight ratio metric to identify low-priority layers and recycles their updates from the previous round (rather than discarding them), achieving accuracy nearly identical to FedAvg with only 17% communication overhead.
Learning at the Speed of Physics: Equilibrium Propagation on Oscillator Ising Machines: This work presents the first complete mapping of Equilibrium Propagation (EP) onto Oscillator Ising Machine (OIM) hardware, leveraging GHz-scale physical dynamics to enable backpropagation-free local learning. The approach achieves 97.2%/88.0% accuracy on MNIST/Fashion-MNIST and demonstrates robustness under parameter quantization and noise.
Learning from Interval Targets: This paper studies regression under interval-only supervision (lower and upper bounds), establishes non-asymptotic generalization bounds based on hypothesis class smoothness without requiring a small ambiguity degree assumption, and proposes a minmax learning framework that leverages smoothness constraints to limit worst-case labels, achieving significant improvements over unconstrained methods across 18 real-world datasets.
Learning Orthogonal Multi-Index Models: A Fine-Grained Information Exponent Analysis: This paper proves that orthogonal multi-index models $f_*(\mathbf{x}) = \sum_{k=1}^P \phi(\mathbf{v}_k^* \cdot \mathbf{x})$ can be learned via a two-phase online SGD with sample complexity $\tilde{O}(dP^{L-1})$ (where $L$ is the lowest higher-order Hermite degree of the link function), significantly improving upon $\tilde{O}(Pd^{L-1})$ obtained by using only the lowest-order information. The key insight is to first recover the subspace using second-order terms and then recover the directions using $L$-th order terms, jointly exploiting Hermite components of different orders.
Learning Parameterized Skills from Demonstrations: This paper proposes DEPS, an end-to-end algorithm for discovering parameterized skills from expert demonstrations. Through a three-level hierarchical policy (discrete skill selection → continuous parameter selection → low-level actions) and an information bottleneck design, DEPS learns interpretable and generalizable skill abstractions, achieving significant improvements over baselines on LIBERO and MetaWorld.
Learning Provably Improves the Convergence of Gradient Descent: This paper presents the first rigorous proof of training convergence for an unrolling-based Learn to Optimize (L2O) framework (Math-L2O). By leveraging NTK theory, it establishes a linear convergence rate and proposes a deterministic initialization strategy that provably ensures L2O improves upon the convergence performance of gradient descent. Experiments demonstrate over 50% improvement in optimality compared to standard GD.
Learning Quadratic Neural Networks in High Dimensions: SGD Dynamics and Scaling Laws: This paper provides a precise analysis of gradient-based training of two-layer neural networks with quadratic activations in high dimensions. Under the setting where data is generated by $f_*(x) \propto \sum_{j=1}^r \lambda_j \sigma(\langle \theta_j, x \rangle)$, with extensive width $r \asymp d^\beta$ and power-law decaying coefficients $\lambda_j \asymp j^{-\alpha}$, the paper derives scaling laws for the prediction risk as a function of optimization time, sample size, and model width.
Learning Reconfigurable Representations for Multimodal Federated Learning with Missing Data: This paper proposes PEPSY, a framework that learns client-side embedding controls to encode data-missing patterns, reconfiguring globally aggregated representations into complete-data features adapted to each client's local context, addressing both modality-missing and feature-missing scenarios in multimodal federated learning.
Learning Single-Index Models via Harmonic Decomposition: This paper proposes spherical harmonics as a natural basis for single-index models (SIMs) in place of Hermite polynomials, leveraging rotational symmetry to characterize sample and computational complexity for learning SIMs under arbitrary spherically symmetric input distributions. Two families of optimal estimators are constructed (tensor unfolding + online SGD), and a sample–runtime tradeoff is revealed that does not arise in the Gaussian setting.
Learning Sparse Approximate Inverse Preconditioners for Conjugate Gradient Solvers on GPUs: This paper proposes a graph neural network (GNN)-based method for learning sparse approximate inverse (SPAI) preconditioners, exploiting the natural compatibility between SPAI's locality and GNN message passing, and introduces a scale-invariant loss function (SAI loss) that achieves 40%–53% reduction in solve time (68%–113% speedup) on GPUs.
Learning Theory for Kernel Bilevel Optimization: This work establishes the first finite-sample generalization bounds for kernel bilevel optimization (KBO), proving that plug-in estimation errors for both the objective value and its gradient converge uniformly at the parametric rate $\mathcal{O}(1/\sqrt{m}+1/\sqrt{n})$, and applies this theory to analyze the statistical accuracy of bilevel gradient descent.
Learning to Insert for Constructive Neural Vehicle Routing Solver: This paper proposes L2C-Insert, the first learning-based insertion construction paradigm for neural combinatorial optimization. By allowing nodes to be inserted at any feasible position within a partial solution—rather than appended only to its end—the method significantly improves construction quality and flexibility for TSP/CVRP.
Least Squares Variational Inference: This paper proposes LSVI (Least Squares Variational Inference), a gradient-free variational inference method based on ordinary least squares regression. Within the exponential family, LSVI iteratively solves for the optimal variational approximation by performing OLS regression on a tempered log-target, admitting efficient $O(d^3)$ (full-covariance) or $O(d)$ (mean-field) implementations for the Gaussian family.
MAR-FL: A Communication Efficient Peer-to-Peer Federated Learning System: This paper proposes MAR-FL, a system that reduces the communication complexity of P2P federated learning from $O(N^2)$ to $O(N \log N)$ via a Moshpit All-Reduce mechanism and dynamic group aggregation, while maintaining robustness against network jitter.
MDNS: Masked Diffusion Neural Sampler via Stochastic Optimal Control: This paper proposes the Masked Diffusion Neural Sampler (MDNS), a framework grounded in stochastic optimal control theory for continuous-time Markov chains (CTMCs). By aligning path measures, MDNS trains a discrete neural sampler capable of accurately sampling from Ising/Potts models with state spaces as large as $10^{122}$, substantially outperforming existing learning-based baselines.
MeCeFO: Enhancing LLM Training Robustness via Fault-Tolerant Optimization: MeCeFO proposes a fault-tolerant optimization algorithm for LLM training that minimizes overhead during node failures through three techniques—skip-connections, selective activation recomputation, and low-rank gradient approximation—achieving only a 4.18% throughput drop under high-frequency failure conditions.
Memory-Augmented Potential Field Theory: A Framework for Adaptive Control in Non-Convex Domains: This paper proposes Memory-Augmented Potential Field Theory (MAPFT), which maintains a dynamic memory module within stochastic optimal control to detect and encode topological features of the state space (local minima, low-gradient regions, etc.), and adaptively reshapes the value function landscape to enable control in non-convex environments. On tasks such as Humanoid-v4, the method achieves a 27% improvement in cumulative reward over the best RL baseline (SAC), and raises the local optima escape rate from ~30% to ~72%.
MESS+: Dynamically Learned Inference-Time LLM Routing in Model Zoos with Service Level Guarantees: MESS+ is the first framework to formalize LLM request routing as a constrained stochastic optimization problem with SLA guarantees. By combining an online-learned request satisfaction predictor with a virtual queue mechanism, it dynamically selects models per request. Across 3 reasoning and 5 question-answering benchmarks, MESS+ achieves an average 2× cost reduction while satisfying SLA constraints, with theoretical guarantees on both cost optimality and constraint satisfaction.
MOBO-OSD: Batch Multi-Objective Bayesian Optimization via Orthogonal Search Directions: This paper proposes MOBO-OSD, an algorithm that generates diverse Pareto-optimal solutions by defining orthogonal search directions on an approximated convex hull of individual minima (CHIM), combined with Pareto front estimation and a batch selection strategy, consistently outperforming state-of-the-art multi-objective Bayesian optimization methods on both synthetic and real-world benchmarks.
Multi-head Transformers Provably Learn Symbolic Multi-step Reasoning via Gradient Descent: This work rigorously proves, from the perspective of gradient descent training dynamics, that a single-layer multi-head Transformer can learn both forward and backward reasoning on a tree path-finding task via Chain-of-Thought, and reveals that distinct attention heads spontaneously specialize to collaboratively solve multi-stage subtasks.
Multiplayer Federated Learning: Reaching Equilibrium with Less Communication: This paper proposes the Multiplayer Federated Learning (MpFL) framework, which models FL clients as rational players in a game-theoretic setting, and introduces the PEARL-SGD algorithm that reduces communication overhead via local updates while converging to a Nash equilibrium.
Natural Gradient Descent for Improving Variational Inference Based Classification of Radio Galaxies: This work replaces standard SGD with the natural gradient descent optimizer iVON for optimizing BNN parameters under variational inference, achieving better uncertainty calibration in radio galaxy classification while maintaining predictive performance comparable to HMC and BBB-VI.
Natural Gradient VI: Guarantees for Non-Conjugate Models: Under mean-field parameterization, this paper establishes three key theoretical results for natural gradient variational inference (NGVI) in non-conjugate models: a relative smoothness condition on the variational loss, a global convergence-to-stationary-point guarantee for a modified NGVI with non-Euclidean projections, and, under additional structural assumptions, hidden convexity and fast global convergence guarantees.
Near-Exponential Savings for Mean Estimation with Active Learning: This paper proposes the PartiBandits algorithm, which combines disagreement-based active learning with UCB-style stratified sampling to achieve near-exponential label savings for mean estimation when auxiliary information $X$ is predictive of the target variable $Y$.
Neural Thermodynamics: Entropic Forces in Deep and Universal Representation Learning: This paper establishes a "neural thermodynamics" framework, proving that emergent entropic forces arising from stochasticity and discrete-time updates in SGD training systematically break continuous parameter symmetries while preserving discrete ones. This leads to a gradient balance phenomenon analogous to thermodynamic equipartition, thereby (a) providing the first theoretical proof of the Platonic Representation Hypothesis (that different models learn similar representations), and (b) reconciling the seemingly contradictory observations of sharpness-seeking and flatness-seeking behavior in deep learning optimization.
NeuSymEA: Neuro-symbolic Entity Alignment via Variational Inference: This paper proposes NeuSymEA, a neuro-symbolic reasoning framework based on a variational EM algorithm that unifies symbolic rule reasoning and neural network embeddings within a Markov Random Field for entity alignment, achieving significant performance gains and low-resource robustness on DBP15K.
Non-Stationary Bandit Convex Optimization: A Comprehensive Study: This paper systematically studies bandit convex optimization (BCO) in non-stationary environments, proposes two algorithms (TEWA-SE and cExO), and establishes unified regret upper and lower bounds under three non-stationarity measures (number of switches $S$, total variation $\Delta$, and path length $P$), achieving minimax optimality in several settings.
On Minimax Estimation of Parameters in Softmax-Contaminated Mixture of Experts: This paper presents the first minimax parameter estimation analysis for contaminated Mixture of Experts (MoE) models with softmax gating. It introduces the concept of "distinguishability" to characterize the relationship between a pretrained model and a prompt, proving that MLE achieves the parametric rate $\tilde{O}(n^{-1/2})$ when distinguishability holds, while the rate degrades significantly otherwise.
Online Two-Stage Submodular Maximization: This paper introduces the Online Two-Stage Submodular Maximization (O2SSM) problem for the first time, and proposes the RAOCO algorithm for Weighted Threshold Potential (WTP) functions. By combining fractional relaxation with randomized pipage rounding, RAOCO achieves sublinear $(1-1/e)^2$-regret in polynomial time, while also improving the approximation ratio for the offline problem.
Optimal Rates for Generalization of Gradient Descent for Deep ReLU Classification: This paper establishes a generalization rate of $\widetilde{O}(L^4(1+\gamma L^2)/(n\gamma^2))$ for gradient descent on deep ReLU networks, achieving for the first time simultaneously: (1) the optimal $1/n$ dependence on sample size $n$, and (2) only polynomial dependence on depth $L$.
Optimality and NP-Hardness of Transformers in Learning Markovian Dynamical Functions: This paper analyzes the ICL capability of Transformers for learning Markovian dynamical functions from an optimization-theoretic perspective. It derives the global optimal solution (in closed form) for single-layer linear self-attention, proves that recovering Transformer parameters from an extended parameter space is NP-hard, and reveals that multi-layer LSA is equivalent to preconditioned multi-objective optimization.
Optimistic Online-to-Batch Conversions for Accelerated Convergence and Universality: This paper proposes an optimistic online-to-batch (O2B) conversion framework that relocates optimism from the online algorithm to the conversion mechanism itself, enabling simple online gradient descent (OGD) to achieve an $O(T^{-2})$ accelerated convergence rate. It is the first to achieve optimal convergence for strongly convex smooth objectives via O2B conversion, while simultaneously attaining universality with respect to smoothness.
Oracle-Efficient Combinatorial Semi-Bandits: Two oracle-efficient frameworks (adaptive and scheduled) are proposed for combinatorial semi-bandits, reducing oracle calls from $\Theta(T)$ to $O(\log\log T)$ while maintaining near-optimal regret bounds.
OrthoGrad Improves Neural Calibration: This paper presents the first systematic study of OrthoGrad (⊥Grad)—a geometrically constrained optimization method that projects gradients layer-wise onto directions orthogonal to the weights—for neural network calibration. Experiments on CIFAR-10 in low-data regimes demonstrate that OrthoGrad significantly improves calibration metrics (entropy, loss, confidence) without degrading accuracy, and the paper establishes convergence guarantees for a simplified variant under standard assumptions.
Personalized Subgraph Federated Learning with Differentiable Auxiliary Projections: This paper proposes FedAux, a framework that introduces differentiable Auxiliary Projection Vectors (APVs) to map node embeddings into a one-dimensional space and perform soft-ranking aggregation via Gaussian kernels. The APV simultaneously serves as a compact, privacy-preserving summary of the local subgraph for server-side similarity computation and participates in joint client-side optimization, enabling personalized subgraph federated learning.
perturbation bounds for low-rank inverse approximations under noise: This paper provides the first non-asymptotic spectral norm perturbation bound for low-rank inverse approximations $\|(\tilde{A}^{-1})_p - A_p^{-1}\|$ under additive noise. Using contour integration techniques, it derives sharp bounds depending on the eigenvalue gap, spectral decay, and noise alignment, improving upon classical full-inverse bounds by up to a factor of $\sqrt{n}$.
Preference Learning with Response Time: Robust Losses and Guarantees: This paper incorporates response time information from user decision-making into the preference learning framework, reducing the estimation error of reward model learning from exponential to polynomial order via a Neyman orthogonal loss function.
Probing Neural Combinatorial Optimization Models: This work is the first to systematically introduce probing methodology into the study of neural combinatorial optimization (NCO) models. It proposes a CS-Probing framework to analyze the decision-relevant knowledge, inductive biases, and generalization mechanisms encoded in model representations, and identifies critical embedding dimensions that can be leveraged to improve model generalization.
Problem-Parameter-Free Decentralized Bilevel Optimization: This paper proposes AdaSDBO, a fully parameter-free decentralized bilevel optimization algorithm that requires no prior knowledge of problem parameters. By employing adaptive step sizes based on cumulative gradient norms, it achieves a convergence rate of $\tilde{O}(1/\sqrt{T})$.
PROFIT: A Specialized Optimizer for Deep Fine Tuning: PROFIT frames fine-tuning as a multi-task learning problem across the time dimension, and achieves forgetting-resistant fine-tuning without additional data or parameters by orthogonally projecting new-task gradients onto the direction of a "regression equilibrium point."
Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry: This paper reveals a fundamental duality between sparse autoencoder (SAE) architectures and the concept structures they are capable of discovering — each SAE implicitly assumes a particular organization of concepts, and when this assumption is mismatched, concepts are systematically missed. Based on this analysis, the authors propose SpaDE, a novel SAE that accounts for nonlinear separability and dimensional heterogeneity.
Purifying Shampoo: Investigating Shampoo's Heuristics by Decomposing its Preconditioner: By decomposing Shampoo's preconditioner into eigenvalue and eigenbasis components, this work reveals that learning rate grafting essentially compensates for eigenvalue staleness and scaling bias, and proposes eigenvalue correction and adaptive eigenbasis update frequency as principled replacements for these heuristics.
Quantitative Convergence of Trained Single Layer Neural Networks to Gaussian Processes: This paper establishes explicit quantitative upper bounds on the convergence of gradient-descent-trained shallow neural networks to Gaussian processes at any positive training time $t \geq 0$, proving that the squared 2-Wasserstein distance decays polynomially at rate $O(\log n_1 / n_1)$.
Rao-Blackwellised Reparameterisation Gradients: This paper proposes the R2-G2 estimator as a Rao-Blackwellised variant of reparameterisation gradients, proves that local reparameterisation in Bayesian MLPs is a special case thereof, and extends the low-variance gradient advantage to a broad class of probabilistic models.
Rethinking Neural Combinatorial Optimization for Vehicle Routing Problems with Different Constraint Tightness Degrees: This paper reveals that existing NCO methods severely overfit to fixed constraint tightness (e.g., fixed vehicle capacity $C=50$ in CVRP), and proposes a Variable Constraint Tightness (VCT) training scheme along with a Multiple Expert Module (MEM), enabling models to effectively handle the full spectrum of constraints from extremely tight to extremely loose.
Revisiting Orbital Minimization Method for Neural Operator Decomposition: This paper revisits the classical Orbital Minimization Method (OMM) originating from computational chemistry, provides a concise linear-algebraic consistency proof, reveals its deep connections to Sanger's rule and streaming PCA, and generalizes it into a unified framework for training neural networks to perform spectral decomposition of positive semidefinite operators.
Robust Estimation Under Heterogeneous Corruption Rates: This paper studies robust estimation under heterogeneous corruption rates—where each sample is corrupted with a distinct known probability—and establishes tight minimax rates for mean estimation and linear regression under both bounded and Gaussian distributions. A key finding is that the optimal estimator can simply discard samples whose corruption rates exceed a certain threshold.
Second-Order Optimization Under Heavy-Tailed Noise: Hessian Clipping and Sample Complexity: This paper provides the first systematic theoretical treatment of second-order stochastic optimization under heavy-tailed noise. It establishes tight minimax sample complexity lower bounds, proposes a normalized SGD algorithm with gradient and Hessian clipping (Clip NSGDHess), and proves that the proposed algorithm nearly achieves the information-theoretic limit.
Set Smoothness Unlocks Clarke Hyper-stationarity in Bilevel Optimization: This paper introduces set smoothness, a novel structural property of set-valued mappings, proves that it holds naturally in the nonconvex-PŁ bilevel setting, and uses it to reveal hidden weak convexity/concavity structure in the hyper-objective. This yields the first computability guarantee for Clarke stationary points of nonsmooth hyper-objectives.
Sharper Convergence Rates for Nonconvex Optimisation via Reduction Mappings: This paper proposes the Reduction Mapping framework, which exploits the manifold structure of the optimal solution set (arising from over-parameterization or symmetry) to reparameterize the optimization problem, and proves that this improves curvature properties and theoretically accelerates the convergence of gradient-based methods.
Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful: This paper systematically investigates the behavior of small batch sizes (down to batch size = 1) in language model pre-training and fine-tuning. It proposes a scaling rule for Adam $\beta_2$ based on fixing the "token half-life," demonstrates that small-batch training is stable, shows that vanilla SGD becomes competitive with adaptive optimizers under small batches, and recommends avoiding gradient accumulation.
Stable Coresets via Posterior Sampling: Aligning Induced and Full Loss Landscapes: This paper proposes a coreset selection framework based on posterior sampling. By sampling weight perturbations on BatchNorm layers to smooth the loss landscape, it guarantees alignment between the coreset and full-dataset loss landscapes (including approximations of the Hessian and Newton step), significantly outperforming existing methods under high label noise.
Stochastic Momentum Methods for Non-smooth Non-Convex Finite-Sum Coupled Compositional Optimization: For non-smooth non-convex finite-sum coupled compositional optimization (FCCO), this paper proposes two stochastic momentum methods — SONEX (single-loop) and ALEXR2 (double-loop) — that improve the iteration complexity from $O(1/\epsilon^6)$ to $O(1/\epsilon^5)$ via outer Moreau envelope smoothing and nested smoothing techniques, and achieve the same optimal complexity for non-convex inequality-constrained optimization.
Streaming Federated Learning with Markovian Data: This work provides the first rigorous analysis of streaming federated learning with Markovian data under non-convex objectives, establishing that Minibatch SGD, Local SGD, and Local SGD-M all achieve sample complexity inversely proportional to the number of clients (linear speedup), and that Local SGD-M matches the communication complexity of Minibatch SGD without requiring heterogeneity assumptions.
The Implicit Bias of Structured State Space Models Can Be Poisoned With Clean Labels: This paper provides the first theoretical proof that the implicit bias of Structured State Space Models (SSMs) can be poisoned by clean-label training samples — specifically, there exist specially constructed training examples whose labels are correctly assigned by a teacher model, yet their inclusion fundamentally distorts the implicit bias of SSMs, causing complete generalization failure.
The Rich and the Simple: On the Implicit Bias of Adam and SGD: This paper provides theoretical and empirical evidence that neural networks trained with SGD tend to learn simple linear features (simplicity bias), whereas Adam produces richer nonlinear features, yielding predictors closer to the Bayes-optimal classifier and better generalization under distribution shift.
Training-Free Bayesianization for Low-Rank Adapters of Large Language Models: This paper proposes TFB (Training-Free Bayesianization), which converts a pre-trained LoRA adapter into its Bayesian counterpart without any retraining by searching for the maximum admissible variance within a family of low-rank isotropic Gaussian distributions. The procedure is theoretically shown to be equivalent to generalized variational inference.
Training Robust Graph Neural Networks by Modeling Noise Dependencies: This paper proposes Dependency-Aware Graph Noise (DANG) and the DA-GNN framework, which model a causal dependency chain from node feature noise → graph structure noise → label noise, and employ variational inference to derive an ELBO objective for training GNNs robust to multi-source correlated noise.
Understanding Adam Requires Better Rotation Dependent Assumptions: Through systematic empirical investigation, this paper demonstrates that the Adam optimizer exhibits strong dependence on the choice of coordinate basis in parameter space, showing that existing rotation-invariant theoretical assumptions are insufficient to explain Adam's superiority. The orthogonality of per-layer updates is identified as a reliable indicator for predicting Adam's performance under different bases.
Understanding the Generalization of Stochastic Gradient Adam in Learning Neural Networks: This work presents the first theoretical analysis of the generalization behavior of mini-batch Adam. It proves that large-batch Adam/AdamW converges to solutions with high test error even with weight decay, whereas small-batch variants achieve near-zero test error through a combination of implicit regularization from stochastic gradients and explicit regularization from weight decay. Moreover, the effective weight decay upper bound for Adam is strictly smaller than that for AdamW.
Unveiling m-Sharpness Through the Structure of Stochastic Gradient Noise: This paper reveals the theoretical mechanism underlying m-sharpness in SAM through an extended stochastic differential equation (SDE) framework — smaller micro-batch size $m$ induces stronger implicit regularization via the covariance of stochastic gradient noise (SGN) — and proposes a parallelizable Reweighted SAM (RW-SAM) method based on this insight.
Unveiling the Power of Multiple Gossip Steps: A Stability-Based Generalization Analysis in Decentralized Training: This paper presents the first stability-based generalization analysis of Multiple Gossip Steps (MGS) in Decentralized SGD (DSGD), proving that MGS reduces optimization error at an exponential rate to tighten generalization bounds, while also establishing that even as the number of Gossip steps approaches infinity, DSGD cannot fully close the generalization gap with centralized training.
VERA: Variational Inference Framework for Jailbreaking Large Language Models: This paper formalizes black-box LLM jailbreaking as a variational inference problem, training a small attacker LLM to approximate the posterior distribution of adversarial prompts for a target LLM. Once trained, the attacker can efficiently generate diverse jailbreak prompts without relying on human-crafted templates.
Verbalized Algorithms: Zero-shot Classical Algorithmic Reasoning for Correctness and Runtime Guarantees: This paper proposes the Verbalized Algorithms (VAs) framework, which preserves the control flow of classical algorithms while replacing their atomic operations (e.g., binary comparisons) with LLM queries, thereby inheriting the correctness and complexity guarantees of classical algorithms for natural language reasoning tasks. The framework is validated across four case studies: sorting, maximum finding, clustering, and submodular maximization.
VIKING: Deep Variational Inference with Stochastic Projections: VIKING proposes a variational approximate posterior family based on the kernel- and image-space decomposition of the Fisher-Rao metric, and achieves scalable full-covariance Bayesian training via a stochastic alternating projections algorithm, surpassing existing Bayesian deep learning methods on multiple benchmarks.
Wasserstein Transfer Learning: This paper proposes WaTL, the first transfer learning framework for distributional outputs in Wasserstein space. It adopts a three-step procedure — weighted auxiliary estimation, bias correction, and projection — combined with adaptive source selection, to transfer knowledge from source domains and improve distributional regression in the target domain.

🔬 Interpretability¶

A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders: This paper identifies and systematically studies the phenomenon of "feature absorption" in SAEs: apparently monosemantic SAE latents fail to activate on certain tokens because their feature directions are "absorbed" by more specific sub-latents. This is shown to be an inevitable consequence of hierarchical features combined with sparsity loss, posing a fundamental challenge to using SAEs for reliable LLM interpretation.
A Unified Reasoning Framework for Holistic Zero-Shot Video Anomaly Analysis: A fully zero-shot, training-free video anomaly analysis framework that employs Intra-Task Reasoning (confidence-gated self-refinement) and Inter-Task Chaining (cascaded prompt passing from temporal detection to spatial localization to semantic understanding), achieving comprehensive improvements of 4–6% AUC over prior zero-shot methods across 4 benchmarks.
AdaptGrad: Adaptive Sampling to Reduce Noise: AdaptGrad analyzes the theoretical origin of noise in SmoothGrad—out-of-boundary (OOB) sampling behavior—and proposes adaptively adjusting the Gaussian sampling variance for each input dimension to bound the additional noise. The method nearly eliminates gradient noise while revealing richer fine-grained features, requires minimal implementation effort, and is composable with arbitrary gradient-based explanation methods.
Additive Models Explained: A Computational Complexity Approach: This paper presents a systematic computational complexity analysis of multiple explanation types for Generalized Additive Models (GAMs), covering 54 combinations of "component model × input domain × explanation method." It reveals that the explanation complexity of GAMs is highly sensitive to the type of input domain — a phenomenon never observed in other ML models such as decision trees or neural networks — thereby challenging the intuitive assumption that "additive implies interpretable."
AgentiQL: An Agent-Inspired Multi-Expert Framework for Text-to-SQL Generation: This paper proposes AgentiQL, a multi-expert agent framework for Text-to-SQL: a reasoning agent decomposes questions into sub-problems, a coding agent generates sub-queries, a refinement step corrects column selection, and an adaptive router intelligently routes between a baseline parser and the modular pipeline. Using a 14B open-source model, AgentiQL achieves 86.07% EX on Spider, approaching GPT-4 SOTA (89.65%).
An Analysis of Concept Bottleneck Models: Measuring, Understanding, and Mitigating the Impact of Noisy Annotations: This paper presents the first systematic study on the impact of annotation noise on Concept Bottleneck Models (CBMs). It identifies approximately 23% of concepts as "susceptible concepts" that drive the majority of performance degradation, and proposes a two-stage mitigation strategy combining SAM at training time and uncertainty-guided intervention at inference time to restore model robustness.
Are Greedy Task Orderings Better Than Random in Continual Linear Regression?: This paper systematically analyzes the convergence differences between greedy task orderings (maximizing dissimilarity between consecutive tasks) and random orderings in continual linear regression. It reveals that greedy orderings are competitive with random orderings in the full-rank setting, but single-pass greedy ordering can fail catastrophically in the general-rank setting, whereas greedy ordering with repetition achieves a convergence rate of $\mathcal{O}(1/\sqrt[3]{k})$.
ARECHO: Autoregressive Evaluation via Chain-Based Hypothesis Optimization for Speech Multi-Metric Estimation: ARECHO frames speech multi-metric evaluation as a chain-based autoregressive token prediction task. It designs a unified speech information tokenization pipeline to handle 87 heterogeneous metrics (numerical/categorical/bounded/unbounded), explicitly captures inter-metric dependencies (e.g., intelligibility–naturalness correlation) via dynamic classification chains, and employs two-step confidence-guided decoding to reduce error propagation. ARECHO comprehensively outperforms the UniVERSA baseline across enhancement, synthesis, and noisy speech evaluation (Avg Test MSE 23.26 vs. 96.99, −76%).
ARC-JSD: Attributing Response to Context via Jensen-Shannon Divergence Driven Mechanistic Study: ARC-JSD proposes a RAG context attribution method based on Jensen-Shannon Divergence — by comparing the JSD between model output distributions with and without specific context sentences, it localizes the context that a response depends on without fine-tuning or gradient computation. The method achieves 3× faster computation than baselines, improves Top-1 attribution accuracy by 10.7% on average, and reveals via Logit Lens that attribution-relevant attention heads are concentrated in higher layers.
Auditing Meta-Cognitive Hallucinations in Reasoning Large Language Models: This paper systematically audits the generation and propagation mechanisms of hallucinations in reasoning large language models (RLLMs), finding that reflection in long CoT amplifies hallucinations through metacognitive bias rather than correcting them. Even targeted interventions at the hallucination source fail to alter final outputs (chain disloyalty), exposing critical shortcomings of existing hallucination detection methods in multi-step reasoning scenarios.
Base Models Know How to Reason, Thinking Models Learn When: Through unsupervised SAE clustering, this work discovers a taxonomy of reasoning mechanisms in thinking models, then activates the corresponding latent capabilities in base models via steering vectors. The resulting hybrid model recovers up to 91% of the performance gap between thinking and base models—without any weight updates—demonstrating that base models already possess reasoning capabilities, and that thinking models merely learn when to deploy them.
Better Estimation of the Kullback-Leibler Divergence Between Language Models: This paper proposes a Rao-Blackwellized Monte Carlo estimator for KL divergence—computing the exact KL over the next-token distribution at each position (rather than relying solely on the sampled token). The estimator is theoretically proven to be unbiased with variance strictly no greater than the standard MC estimator, incurs zero additional computational overhead, and yields more stable training in an RLHF sentiment-control task, with models appearing on the Pareto frontier 78% of the time.
Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning: This paper proposes SPARKLE, a three-axis analytical framework (plan following, knowledge integration, subproblem decomposition) for fine-grained dissection of how RL shapes LLM reasoning behavior. The analysis reveals that RL primarily enhances knowledge integration and planning flexibility rather than plan execution. The paper further introduces SparkleRL-PSS, a multi-stage RL training pipeline that effectively exploits hard problem data via partial step scaffolding.
Beyond Components: Singular Vector-Based Interpretability of Transformer Circuits: This paper proposes a direction-level interpretability framework based on SVD singular vectors. By applying unified SVD decomposition to augmented matrices of attention heads and MLPs, combined with a learnable diagonal mask optimized via KL+L₁, the framework reveals orthogonal low-rank subfunctions superposed within a single component — on the IOI task, retaining only ~9% of directions suffices to reproduce model behavior with KLD=0.21.
Beyond Token Probes: Hallucination Detection via Activation Tensors with ACT-ViT: This paper organizes all hidden-layer activations of an LLM into an "activation tensor" (layers × tokens × hidden dimension), treats it analogously to an image, and processes it with a ViT-based architecture (ACT-ViT) that supports joint training across multiple LLMs. The method consistently outperforms conventional probing approaches across 15 LLM–dataset combinations and demonstrates strong zero-shot/few-shot transfer to unseen datasets and unseen LLMs.
Bigram Subnetworks: Mapping to Next Tokens in Transformer Language Models: Using continuous sparsification, the authors identify bigram subnetworks containing only ~10M parameters within Transformer language models. These subnetworks are concentrated in the first MLP layer, suffice to reproduce bigram predictions ($r>0.95$), and cause dramatic performance degradation when ablated — demonstrating that they constitute minimal next-token prediction circuits that are both necessary and sufficient in language models.
Causal Head Gating: A Framework for Interpreting Roles of Attention Heads in Transformers: This paper proposes Causal Head Gating (CHG), which learns a differentiable gating scalar for each attention head in a Transformer and applies positive/negative regularization to classify heads into three causal roles—facilitating, interfering, and irrelevant—without requiring manual labels or prompt templates. The framework discovers causal sub-circuits at scale and extends to Contrastive CHG for disentangling independent circuits underlying in-context learning (ICL) and instruction following.
CBMAS: Cognitive Behavioral Modeling via Activation Steering: CBMAS proposes a framework that repurposes activation steering as a continuous diagnostic tool. By conducting dense α-sweeps and decoupling injection layers from readout layers, the framework elevates cognitive bias analysis from a binary "biased / unbiased" judgment to a continuous trajectory analysis capable of tracking flip points, propagation paths, and attenuation patterns. Experiments on GPT-2 Small reveal that appeasement behavior is strongly encoded in shallow layers but decays rapidly toward deeper layers.
CHiQPM: Calibrated Hierarchical Interpretable Image Classification: CHiQPM proposes a calibrated hierarchical interpretable image classification method that selects and assigns features to classes via quadratic programming, constructs hierarchical explanation paths, and incorporates interpretable Conformal Prediction set prediction, retaining 99% of black-box model accuracy while providing both global and local interpretability.
Cognitive Mirrors: Exploring the Diverse Functional Roles of Attention Heads in LLM Reasoning: This paper proposes the CogQA benchmark dataset and a multi-class probing framework to systematically analyze cognitive functional specialization of attention heads in LLMs. The study reveals that cognitive heads exhibit sparsity, universality, and hierarchical functional organization; ablating cognitive heads significantly degrades reasoning performance, while amplifying them improves accuracy.
Conditional Distribution Compression via the Kernel Conditional Mean Embedding: This work presents the first compression algorithm targeting conditional distributions (rather than joint distributions), introducing a novel metric AMCMD based on the kernel conditional mean embedding (KCME) and a linear-time algorithm ACKIP for constructing compressed datasets that preserve the statistical properties of conditional distributions.
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter: This paper proposes Curvature Tuning (CT), which provably modulates the curvature of a model's decision boundary by injecting a single hyperparameter $\beta$ into the activation function. CT improves generalization and robustness without modifying weights, and as a fine-tuning method requires far fewer parameters than LoRA rank 1.
Dataset Distillation for Pre-Trained Self-Supervised Vision Models: This paper proposes Linear Gradient Matching, a dataset distillation method for pre-trained self-supervised vision models. A single synthetic image per class suffices to train a linear classifier approaching full-dataset performance, and the distilled images transfer across model architectures.
Deep Modularity Networks with Diversity-Preserving Regularization: This work augments Deep Modularity Networks (DMoN) with three diversity-preserving regularization terms—distance-based, variance-based, and entropy-based—to explicitly promote inter-cluster separation and assignment diversity in feature space, achieving significant clustering quality improvements on feature-rich graph datasets.
Deep Value Benchmark: Measuring Whether Models Generalize Deep Values or Shallow Preferences: This paper proposes the Deep Value Benchmark (DVB), which employs a confound-then-deconfound experimental design to measure whether LLMs learn deep human values or merely memorize shallow preference patterns. Results show that the Deep Value Generalization Rate (DVGR) of all evaluated models averages only 0.30, far below chance level.
Discovering Transformer Circuits via a Hybrid Attribution and Pruning Framework: This paper proposes HAP, a Hybrid Attribution and Pruning framework that first applies fast Edge Attribution Patching (EAP) to filter high-potential subgraphs, then runs precise Edge Pruning (EP) on the reduced search space. On the IOI task with GPT-2 Small, HAP achieves a 46% speedup over pure EP while maintaining comparable circuit faithfulness, and successfully recovers S-inhibition heads that EAP alone fails to identify.
Distributional Autoencoders Know the Score: This paper establishes rigorous theoretical guarantees for the Distributional Principal Autoencoder (DPA): it derives a closed-form relationship between the level-set geometry of the optimal encoder and the score function of the data distribution, and proves that latent components beyond the manifold dimensionality are conditionally independent of the data—thereby unifying distributional learning and intrinsic dimension discovery within a single framework.
Do Different Prompting Methods Yield a Common Task Representation?: By generalizing the Function Vectors (FV) framework from few-shot demonstrations to text instructions, this paper finds that different prompting methods do not induce a unified task representation within LLMs; instead, they activate partially overlapping but largely distinct attention head mechanisms.
Dynamic Algorithm for Explainable k-medians Clustering under lp Norm: This paper presents the first explainable k-medians clustering algorithm for general $\ell_p$ norms, achieving an approximation ratio of $\tilde{O}(p(\log k)^{1+1/p-1/p^2})$ (improving the best known bound for $p=2$), along with the first dynamic variant: maintaining an explainable clustering under center insertions/deletions with $O(d \log^3 k)$ amortized update time and $O(\log k)$ amortized reassignments.
Dynamic Features Adaptation in Networking: Toward Flexible Training and Explainable Inference: This paper proposes DAFI (Drift-Aware Feature Importance), an algorithm that leverages distribution drift detection to dynamically switch between SHAP and MDI feature importance methods. Combined with Adaptive Random Forest (ARF), DAFI enables flexible training and efficient explainable inference in communication network scenarios where features are dynamically introduced over time.
Efficient Vision-Language Reasoning via Adaptive Token Pruning: This paper proposes Adaptive Token Pruning (ATP), a training-free plug-and-play module that selects the most informative visual tokens by fusing ViT CLS attention (intra-modal saliency) and CLIP text-image similarity (inter-modal relevance). ATP achieves less than 1% accuracy degradation on VQA/GQA/COCO Captioning in exchange for approximately 40% FLOPs reduction and 1.5× speedup.
Emergence of Linear Truth Encodings in Language Models: This paper proposes the Truth Co-occurrence Hypothesis (TCH)—that true statements tend to co-occur with other true statements—and uses a minimal single-layer Transformer toy model to provide an end-to-end demonstration of how linear truth subspaces emerge naturally through a two-phase training dynamic (memorization first → truth encoding later). This constitutes the first mechanistic explanation for the widely reported linear truth representations in LLMs.
Empowering Decision Trees via Shape Function Branching: This paper proposes the Shape Generalized Tree (SGT), which replaces the conventional linear threshold split at each internal node of a decision tree with a learnable axis-aligned shape function, enabling the capture of nonlinear feature effects within more compact tree structures while preserving interpretability.
Encoding and Understanding Astrophysical Information in Large Language Model-Generated Summaries: This work investigates whether LLM embeddings encode physically meaningful quantities derived from X-ray astronomical observations—specifically hardness ratios, power-law indices, and variability indices. Results show that structured prompt design improves clustering purity of physical attributes by 5.9%–57.5%, and sparse autoencoders reveal that LLMs infer physical parameters not explicitly stated by recognizing object types.
Evaluating LLMs in Open-Source Games: This work introduces a novel paradigm of open-source games—where agents submit programs rather than raw actions—to systematically evaluate LLMs on strategic reasoning, mutual learning, and cooperative gameplay, finding that LLMs can automatically discover approximate program equilibria.
Explaining Similarity in Vision-Language Encoders with Weighted Banzhaf Interactions: FIxLIP proposes a game-theoretic framework based on weighted Banzhaf interaction indices that unifies the decomposition of similarity predictions in vision-language encoders (e.g., CLIP, SigLIP-2) into first-order token attributions and second-order cross-modal/intra-modal interactions, surpassing existing first-order attribution methods in both efficiency and faithfulness.
FaCT: Faithful Concept Traces for Explaining Neural Network Decisions: This paper proposes FaCT, an inherently interpretable model combining B-cos transformations and sparse autoencoders (SAE) that faithfully decomposes model predictions into concept contributions (Logit = $\sum$ concept contributions) and faithfully visualizes each concept down to the input pixel level (concept activation = $\sum$ pixel contributions). A DINOv2-based C²-score is also introduced to evaluate concept consistency.
Fantastic Features and Where to Find Them: A Probing Method to Combine Features from Multiple Foundation Models: This paper proposes ComBo, a lightweight probing-based adapter that compresses multi-layer activations from multiple frozen foundation models via affine projection, then fuses them with a small transformer—without backpropagation through any backbone. ComBo efficiently integrates complementary representations across models, surpassing prior probing methods and matching distillation-based methods on VTAB-1k.
Far from the Shallow: Brain-Predictive Reasoning Embedding through Residual Disentanglement: This paper proposes a residual disentanglement method that decomposes LLM hidden states into four approximately orthogonal embeddings—lexical, syntactic, semantic, and reasoning—for predicting intracranial ECoG brain signals. The study finds that reasoning signals exhibit independent neural signatures both temporally (~350–400 ms) and spatially (extending beyond classical language areas into visual cortex), revealing a computational alignment between LLM reasoning and human brain processing.
FastDINOv2: Frequency Based Curriculum Learning Improves Robustness and Training Speed: This paper proposes FastDINOv2, a two-stage frequency-based curriculum learning strategy: the model is first trained on low-resolution images for 75% of epochs to learn low-frequency features and accelerate convergence, then trained at full resolution with Gaussian noise patching for the remaining 25% to balance frequency bias. The approach achieves a 1.6× speedup and 2.25× FLOPs reduction while improving robustness.
Improving Perturbation-based Explanations by Understanding the Role of Uncertainty Calibration: This paper reveals a fundamental connection between uncertainty calibration (the alignment between model confidence and actual accuracy) and the quality of perturbation-based explanation methods. It demonstrates that miscalibration of models on perturbed inputs directly degrades the quality of both global and local explanations, and proposes ReCalX, which applies perturbation-level-adaptive temperature scaling to substantially improve the robustness and fidelity of explanations.
Interpretable Next-token Prediction via the Generalized Induction Head: This paper proposes Induction-Gram (GIM), an interpretable language model that combines exact n-gram matching with fuzzy matching. By constructing a "generalized induction head" to retrieve similar sequences from the input context for next-token prediction, it achieves up to 25 percentage points improvement over interpretable baselines and a 20% improvement in fMRI brain response prediction.
Knowing When to Stop: Efficient Context Processing via Latent Sufficiency Signals: This paper proposes dynamic context cutoff, which trains lightweight classifiers to detect "information sufficiency signals" encoded in specific Transformer attention heads, enabling the model to determine when sufficient context has been gathered and terminate processing early. On 6 QA datasets, the method achieves an average accuracy improvement of 3.4% while reducing token consumption by 1.33×.
Latent Principle Discovery for Language Model Self-Improvement: STaPLe proposes a posterior-regularized Monte Carlo EM algorithm that enables small 7–8B models to autonomously discover "principles" (latent principles) guiding self-correction. Through an iterative discover-and-learn loop, the method achieves self-improvement with an 8–10% win-rate gain on AlpacaEval and an average improvement of +0.3 on MT-Bench. The discovered principles can be compressed into an interpretable constitution via clustering.
Learning to Focus: Causal Attention Distillation via Gradient-Guided Token Pruning: This paper proposes the Learning to Focus (LeaF) framework, which leverages gradient-guided detection to identify "confounding tokens" in training data. During knowledge distillation, these tokens are pruned to construct counterfactual samples, aligning the student model's attention to the key contextual tokens attended by the teacher model, thereby improving accuracy on mathematical reasoning and code generation tasks.
LLM Probing with Contrastive Eigenproblems: Improving Understanding and Applicability of CCS: This paper presents a rigorous analysis of the unsupervised probing method CCS (Contrast-Consistent Search) and proposes its reformulation as Contrastive Eigenproblems, yielding closed-form solutions with interpretable eigenvalues. This formulation eliminates CCS's sensitivity to random initialization and naturally extends to multivariate settings.
Minimizing False-Positive Attributions in Explanations of Non-Linear Models: This paper proposes PatternLocal to address false-positive attributions caused by suppressor variables in XAI explanations of non-linear models. The method converts local discriminative surrogate weights into a generative representation, and significantly reduces false-positive feature attributions on three datasets: the XAI-TRIS benchmark, MRI artificial lesions, and EEG motor imagery.
Monte Carlo Expected Threat (MOCET) Scoring: This paper proposes the MOCET (Monte Carlo Expected Threat) scoring framework, which decomposes LLM-generated bioweapon synthesis protocols into sequential Bernoulli trials, combines k-NN semantic embedding-based success probability estimation with Monte Carlo simulation, and produces interpretable, automatable threat quantification metrics for measuring the real-world risk of LLMs in the biosecurity domain.
MoPFormer: Motion-Primitive Transformer for Wearable-Sensor Activity Recognition: MoPFormer is proposed to decompose wearable sensor signals into sequences of motion primitives and model their temporal dependencies via a Transformer, surpassing state-of-the-art methods on multiple HAR benchmarks while maintaining a lightweight architecture.
nnterp: A Standardized Interface for Mechanistic Interpretability of Transformers: nnterp is a lightweight wrapper over NNsight that provides a unified interface for accessing internal activations across 50+ Transformer model variants spanning 21 architecture families, achieved through systematic module renaming and automated validation tests. It ships with built-in interpretability methods including logit lens, patchscope, and activation steering, resolving the fundamental trade-off between the correctness issues of TransformerLens and the lack of standardization in bare NNsight usage.
OrdShap: Feature Position Importance for Sequential Black-Box Models: This paper proposes OrdShap, a feature attribution method for sequential models that, for the first time, decouples Value Importance (VI) from Position Importance (PI) for each feature, providing theoretical guarantees grounded in the Sanchez-Bergantiños game-theoretic value.
Out of Control -- Why Alignment Needs Formal Control Theory (and an Alignment Control Stack): This position paper argues for formal optimal control theory as a foundational tool for AI alignment research, and proposes the Alignment Control Stack (ACS)—a ten-layer hierarchical framework spanning from the physical hardware layer to the social governance layer—for systematically organizing and analyzing measurement, control, and interoperability across different alignment methods.
Partial Information Decomposition via Normalizing Flows in Latent Gaussian Distributions: Two complementary tools are proposed: Thin-PID is an efficient Gaussian PID algorithm (10× faster than existing methods), and Flow-PID applies normalizing flows to map arbitrary input distributions to Gaussian space before computing PID, addressing the infeasibility of PID on continuous high-dimensional data. The paper also resolves an open problem regarding whether the joint Gaussian solution is optimal.
Probabilistic Token Alignment for Large Language Model Fusion: This work reformulates the token alignment problem in LLM fusion as an Optimal Transport (OT) problem, replacing traditional hard mappings with soft probabilistic alignment via dynamic token pairing and the Sinkhorn algorithm. On 78 tasks across 6 benchmarks, PTA-LLM achieves an average improvement of +1.72% over FuseLLM, while substantially mitigating performance degradation on challenging tasks (from −13.04% to −4.07%).
Rectifying Shortcut Behaviors in Preference-based Reward Learning: This paper proposes PRISM (Preference-based Reward Invariance for Shortcut Mitigation), which unifies reward hacking as a shortcut learning problem and employs group-invariant kernels approximated via random feature maps to simultaneously mitigate multiple spurious correlations (verbosity, sycophancy, tone, etc.), achieving consistent improvements on out-of-distribution preference data and downstream policy models.
Saying the Unsaid: Revealing the Hidden Language of Multimodal Systems Through Telephone Games: By running multi-round "telephone games" (image→text→image loops), this paper exploits the preference biases of multimodal systems to quantify the connection strength between concepts in the system's implicit space (i.e., the "hidden language"). It contributes the Telescope dataset (10,000+ concept pairs) and establishes a scalable test-time "world map" of multimodal systems.
scPilot: Large Language Model Reasoning Toward Automated Single-Cell Analysis and Discovery: This work proposes the scPilot framework and scBench benchmark, enabling LLMs to perform "omics-native reasoning" (ONR) directly on single-cell RNA-seq data—reading marker genes, forming hypotheses, invoking tools for verification, and iteratively refining conclusions—achieving an 11% improvement in cell-type annotation accuracy and a 30% reduction in trajectory inference graph-edit distance.
Self-Supervised Contrastive Learning is Approximately Supervised Contrastive Learning: This paper theoretically proves that self-supervised contrastive learning (DCL) is approximately equivalent to a supervised contrastive loss (NSCL), with the gap vanishing at rate $O(1/C)$ as the number of classes increases. It further proves that the global optimum of NSCL satisfies Neural Collapse (augmentation collapse + within-class collapse + Simplex ETF), and proposes a tighter few-shot error bound based on directional CDNV.
SHAP Values via Sparse Fourier Representation: This paper proposes FourierShap, an algorithm that first approximates a black-box predictor as a sparse Fourier representation and then leverages closed-form SHAP value formulas for Fourier basis functions to efficiently compute feature attributions, achieving 10–10,000× speedups over KernelShap while supporting a tunable accuracy–efficiency trade-off.
Simulating Society Requires Simulating Thought: This paper proposes a paradigm shift from "behaviorism" to "cognitive modeling" in LLM-based social simulation. The GenMinds framework models the internal reasoning processes of LLM agents via causal belief graphs, and the RECAP benchmark evaluates reasoning fidelity along three dimensions: traceability, demographic sensitivity, and intervention consistency.
Sloth: Scaling Laws for LLM Skills to Predict Multi-Benchmark Performance Across Families: This paper proposes Skills Scaling Laws (Sloth), which assumes that LLM performance is driven by low-dimensional latent skills (e.g., reasoning, instruction following). By exploiting inter-benchmark correlations, Sloth constructs scaling laws that generalize across model families, enabling prediction of large-model performance on multiple benchmarks using only a small amount of family-specific data.
SpEx: A Spectral Approach to Explainable Clustering: This paper proposes SpEx, a general spectral graph partitioning-based framework for explainable clustering that can "round" any reference clustering (without requiring centroids) into an explainable clustering via coordinate-cut decision trees, or perform reference-free clustering directly on a kNN graph.
Steering Information Utility in Key-Value Memory for Language Model Post-Training: This paper proposes InfoSteer, a lightweight method that treats the FFN layers of Transformers as associative key-value memories, promoting more complete utilization of pretrained knowledge during post-training via forward-pass intervention (boosting key coefficients of low-activation memory vectors) and backward-pass regularization (maximizing the entropy of key distributions). Across 6 models from 3 model families (Qwen/LLaMA/Gemma) and 15 in-distribution and out-of-distribution tasks, consistent improvements are observed, and steered language models exhibit adaptive information allocation behavior.
SynBrain: Enhancing Visual-to-fMRI Synthesis via Probabilistic Representation Learning: This paper proposes SynBrain, a framework that models fMRI responses as visual-semantic-conditioned probability distributions via BrainVAE, and employs an S2N Mapper for one-step semantic-to-neural-space mapping. SynBrain substantially outperforms MindSimulator on visual-to-fMRI synthesis (65% reduction in MSE, 96% improvement in Pearson correlation), and the synthesized fMRI signals effectively enhance few-shot cross-subject decoding performance.
Table as a Modality for Large Language Models: This paper proposes TaMo, a framework that treats tables as an independent modality, encoding their structural information via a hypergraph neural network and fusing the resulting structural embeddings with the text modality of an LLM. TaMo achieves an average improvement of 42.65% over pure-text methods across multiple table reasoning benchmarks, and approaches GPT-4 in terms of structural robustness.
TangledFeatures: Robust Feature Selection in Highly Correlated Spaces: This paper proposes TangledFeatures, a selection framework centered on feature stability, implementing a three-stage pipeline of correlation-graph clustering → ensemble representative selection → random forest refinement. The framework achieves highly reproducible, domain-knowledge-consistent feature subsets across resampling in highly correlated feature spaces, validated on alanine dipeptide backbone torsion angle prediction.
The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?: This paper proves that when alignment maps in causal abstraction are unconstrained by linearity, any neural network can be mapped to any algorithm, rendering causal abstraction trivial and uninformative. This gives rise to the "non-linear representation dilemma"—the absence of a principled trade-off between the complexity and the fidelity of alignment maps.
The Trilemma of Truth in Large Language Models: This paper proposes sAwMIL (Sparse-Aware Multiple Instance Learning), a three-class probing framework that combines MIL and conformal prediction to classify LLM internal activations into true/false/neither, revealing that truth and falsity signals are not encoded as simple bidirectional opposites but as distributed representations spanning a multi-dimensional subspace.
Time-Evolving Dynamical System for Learning Latent Representations of Mouse Visual Cortex: This paper proposes TE-ViDS, a sequential latent variable model that decomposes visual neural activity into an external representation linked to visual stimuli and an internal representation reflecting internal states. By incorporating a time-evolving structure and contrastive learning, TE-ViDS achieves state-of-the-art decoding performance on natural scenes and videos.
How Intrinsic Motivation Shapes Learned Representations in Decision Transformers: A Cognitive Interpretability Analysis: This paper proposes a systematic post-hoc interpretability framework to analyze how intrinsic motivation (based on Random Network Distillation) shapes the geometric structure of the embedding space in Elastic Decision Transformers. The analysis reveals that different intrinsic motivation variants create fundamentally distinct representational structures—EDT-SIL promotes compact representations while EDT-TIL enhances orthogonality—and that embedding properties exhibit strong environment-specific correlations with task performance.
Toward Real-world Text Image Forgery Localization: Structured and Interpretable Data Synthesis: This paper proposes FSTS, a Fourier-series-inspired forgery synthesis framework that models the "invisible distribution" (the high-dimensional distribution of forgery operation parameters) from 16,750 real-world forgery instances collected from 67 human participants, generating synthetic training data that more closely approximates real-world forgeries and substantially improving the generalization of text image forgery localization models.
Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders: This paper proposes Mixture of Decoders (MxD), which decomposes the MLP layers of LLMs into tens of thousands of sparsely activated expert sub-layers (layer-level sparsity). Each expert implements a full-rank linear transformation via Hadamard product tensor factorization. MxD significantly outperforms Transcoders on the sparsity–accuracy trade-off while maintaining interpretability.
Towards Scaling Laws for Symbolic Regression: This work presents the first systematic study of scaling laws for symbolic regression (SR), demonstrating that end-to-end Transformer-based SR follows power-law scaling trends across three orders of magnitude of compute, and derives empirical rules for the optimal token-to-parameter ratio ($\approx 15$), as well as batch size and learning rate scaling with model size.
Transformer Key-Value Memories Are Nearly as Interpretable as Sparse Autoencoders: This paper systematically compares the interpretability of features derived from Transformer feed-forward (FF) layer key-value memories with those learned by sparse autoencoders (SAEs), finding the two approaches perform comparably on existing evaluation metrics—with FF-KV outperforming SAEs on certain dimensions—thereby questioning the necessity of SAEs as a feature discovery tool.
Tropical Attention: Neural Algorithmic Reasoning for Combinatorial Algorithms: Tropical Attention replaces softmax dot-product attention with tropical algebraic geometry, performing piecewise-linear reasoning in tropical projective space to align with the polyhedral decision structures of combinatorial algorithms. It is the first approach to extend neural algorithmic reasoning to NP-hard problems, comprehensively outperforming softmax baselines across three OOD generalization axes: length, magnitude, and noise.
Uncovering Graph Reasoning in Decoder-only Transformers with Circuit Tracing: This paper applies a circuit tracing framework to analyze the internal mechanisms of decoder-only Transformers on graph reasoning tasks, uncovering two core reasoning mechanisms: token merging and structural memorization.
URLs Help, Topics Guide: Understanding Metadata Utility in LLM Training: This paper systematically evaluates three categories of metadata (URLs, quality scores, and topic/format domain information) as pretraining context. The key finding is that only URLs accelerate training (achieving equivalent downstream performance with 60B tokens instead of 100B), and this effect only holds under long prompts (5-shot); quality scores and topic/format domain information do not accelerate training but can be used for classifier-free guidance to enable controllable generation.
VADTree: Explainable Training-Free Video Anomaly Detection via Hierarchical Granularity: This paper proposes VADTree, a training-free video anomaly detection framework that leverages a pretrained Generic Event Boundary Detection (GEBD) model to construct a Hierarchical Granularity-aware Tree (HGTree), enabling adaptive sampling and multi-granularity reasoning for anomalous events of varying temporal spans. VADTree achieves state-of-the-art performance among training-free methods on three benchmarks—UCF-Crime, XD-Violence, and MSAD—and even surpasses certain weakly supervised approaches.
ValuePilot: A Two-Phase Framework for Value-Driven Decision-Making: This paper proposes ValuePilot, a two-phase framework that constructs value-annotated decision scenarios via a Dataset Generation Toolkit (DGT) and performs multi-criteria decision-making through a Decision-Making Module (DMM) conditioned on personalized user value preferences, outperforming strong baselines including GPT-5 in alignment with human decisions.
VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a Unified Concept Set: This paper proposes VL-SAE, a sparse autoencoder with a distance-based encoder and modality-specific decoders that maps the semantics of both visual and linguistic representations onto a unified concept set, thereby interpreting and enhancing the vision-language alignment mechanism of VLMs. The approach yields an average improvement of 0.6–0.9% on zero-shot classification and outperforms the dedicated method VCD on POPE hallucination mitigation.
What Happens During the Loss Plateau? Understanding Abrupt Learning in Transformers: This paper systematically investigates the phenomenon of "abrupt learning" in Transformer training, revealing that during the loss plateau the model has already learned partial solutions while simultaneously exhibiting output repetition bias and representation collapse. It further demonstrates that the slow learning of attention maps constitutes the key bottleneck, with findings validated in the early pretraining stages of LLMs such as Pythia and OLMo.
Why Is Attention Sparse in Particle Transformer?: This paper systematically analyzes the near-binary sparse attention phenomenon observed in Particle Transformer (ParT) after training on jet tagging tasks. Through cross-dataset comparisons and ablation studies, it demonstrates that the sparsity primarily originates from the attention mechanism itself rather than the physics-inspired interaction matrix. Nevertheless, the interaction matrix remains indispensable to final performance by influencing the argmax particle selection for the vast majority of tokens.

📊 LLM Evaluation¶

A High-Dimensional Statistical Method for Optimizing Transfer Quantities in Multi-Source Transfer Learning: This paper proposes a theoretical framework based on K-L divergence and high-dimensional statistical analysis to determine the optimal number of samples to transfer from each source task in multi-source transfer learning. The framework avoids the negative transfer caused by naively using all source data, and the resulting algorithm OTQMS surpasses the state of the art by 1.0–1.5% on DomainNet and Office-Home while reducing sample usage by 47.85% and training time by 35.19%.
A Standardized Benchmark for Multilabel Antimicrobial Peptide Classification: This paper presents ESCAPE—the first standardized multilabel antimicrobial peptide classification benchmark, integrating 80,000+ peptides from 27 public databases, along with a dual-branch Transformer + bidirectional cross-attention baseline model that achieves a 2.56% relative improvement in mAP over the second-best method.
A Unified Framework for Provably Efficient Algorithms to Estimate Shapley Values: This paper proposes a unified framework that subsumes KernelSHAP, LeverageSHAP, and related Shapley value estimators under a randomized sketching perspective, provides the first non-asymptotic theoretical guarantees for KernelSHAP, and extends these methods to high-dimensional datasets such as CIFAR-10 via algorithmic improvements including Poisson approximation.
AdaSTaR: Adaptive Data Sampling for Training Self-Taught Reasoners: This work identifies that random data sampling in STaR (Self-Taught Reasoner) leads to severely imbalanced observation training frequencies—easy problems are over-trained while hard problems are under-trained—and proposes AdaSTaR, which combines adaptive diversity sampling (prioritizing under-trained samples) with adaptive curriculum sampling (adjusting difficulty based on model strength) to achieve the highest accuracy on all 6 benchmarks while reducing training FLOPs by 58.6%.
Aggregation Hides OOD Generalization Failures from Spurious Correlations: This paper reveals the "aggregation masking" phenomenon in OOD generalization benchmarks: while aggregate evaluation exhibits accuracy-on-the-line (AoTL)—a positive correlation between ID and OOD accuracy—the proposed OODSelect method can identify large, semantically coherent subsets (up to 75%) from the same OOD data on which higher ID accuracy corresponds to lower OOD accuracy (Pearson R as low as −0.92), demonstrating that the harm of spurious correlations is systematically concealed by aggregate evaluation.
Asymmetric Duos: Sidekicks Improve Uncertainty: Asymmetric Duos (AD) pairs a large model with a small "sidekick"—combining their predictions via temperature-weighted logit averaging—achieving near-5× deep ensemble uncertainty estimation quality at only 10–20% additional FLOPs. RN50 AD (5% FLOPs overhead) approaches an $m=5$ deep ensemble (400% FLOPs overhead) on AUROC/AURC/SAC@98.
Bayesian Evaluation of Large Language Model Behavior: This paper proposes a Beta-Binomial Bayesian framework for evaluating LLM behavior. By modeling the posterior distribution of $\theta_m$ over stochastic generations for each prompt, the framework quantifies statistical uncertainty in evaluation metrics and introduces sequential sampling strategies such as Thompson sampling to achieve narrower credible intervals with fewer API calls.
Belief-Calibrated Multi-Agent Consensus Seeking for Complex NLP Tasks: This paper proposes the Belief-Calibrated Consensus Seeking (BCCS) framework, which incorporates three modules—belief-calibrated consensus judgment, conflict-aware collaborator assignment, and leader selection—to enable multi-agent systems to reach more stable consensus on complex NLP tasks, yielding improvements of 2.23% and 3.95% on difficult subsets of MATH and MMLU, respectively.
Benchmarking is Broken — Don't Let AI be its Own Judge: This paper systematically critiques the fundamental flaws of current AI benchmark evaluation—data contamination (45%+ overlap in MMLU), selective reporting, and lack of proctoring—and proposes PeerBench: drawing on the proctoring paradigm of high-stakes exams (e.g., SAT/GRE), it constructs a next-generation AI evaluation infrastructure via a rolling confidential question bank, peer-review quality control, reputation-weighted scoring, and cryptographic commitment mechanisms.
Benchmarking Large Language Models for Zero-Shot and Few-Shot Phishing URL Detection: This paper systematically evaluates three commercial LLMs — GPT-4o, Claude-3.7, and Grok-3-Beta — on phishing URL detection under a unified zero-shot and few-shot prompt framework. Results show that few-shot prompting consistently improves performance across all models, with Grok-3-Beta achieving the best F1 (0.9399) on the balanced dataset, while different models exhibit distinct precision–recall trade-off behaviors.
Beyond the Singular: Revealing the Value of Multiple Generations in Benchmark Evaluation: This paper formalizes LLM benchmark evaluation as a hierarchical statistical model, theoretically demonstrates that multiple stochastic generations ($k>1$) reduce the variance of benchmark score estimates, and introduces a prompt-level difficulty metric $\mathbb{P}(\text{correct})$ along with data maps for benchmark quality control.
Beyond the Surface: Enhancing LLM-as-a-Judge Alignment with Human via Internal Representations: This paper proposes LAGER, a framework that aggregates score token logits from intermediate to final layers of an LLM and computes an expected score to derive the final judgment. Without any model fine-tuning, LAGER improves human alignment by up to 7.5% and matches or surpasses reasoning-based methods without requiring chain-of-thought inference.
Bispectral OT: Dataset Comparison using Symmetry-Aware Optimal Transport: This paper proposes Bispectral Optimal Transport (BOT), which replaces the cost matrix in discrete optimal transport from raw pixel distances to bispectrum (group Fourier invariant) distances, enabling transport plans to eliminate group-action-induced variation (e.g., rotation) while preserving signal structure. On rotation-augmented MNIST and related datasets, the class-preservation accuracy improves from 33% to 84%.
BLINK-Twice: You See But Do You Observe? A Reasoning Benchmark on Visual Perception: This paper introduces BLINK-Twice, a vision-centric reasoning benchmark comprising 345 visually challenging images, 103 adversarial samples, 896 VQA pairs, and 1,725 annotated reasoning steps. Through seven categories of visual illusion scenarios, it evaluates the "you see but do not observe" reasoning capability of MLLMs. The strongest model, Gemini-2.5 Pro, achieves only 26.9% G-Acc, suggesting that multi-round image observation and active visual interaction are promising directions for improvement.
Can Large Language Models Master Complex Card Games?: This paper systematically evaluates the ability of LLMs to learn eight complex card games. It finds that through SFT on high-quality game trajectory data, LLMs can approach the performance of strong game AIs and simultaneously master multiple games, though general capabilities degrade — a decline that can be mitigated by mixing in general instruction data.
CLIMB: Class-Imbalanced Learning Benchmark on Tabular Data: This paper presents CLIMB — the most comprehensive benchmark to date for class-imbalanced learning on tabular data — encompassing 73 real-world datasets and 29 CIL algorithms. Large-scale experiments reveal several practical insights: naive rebalancing is often ineffective, ensemble methods are critical, and data quality impacts performance more than the degree of imbalance itself.
CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance: This paper proposes CodeAssistBench (CAB), the first fully automated benchmark for evaluating multi-turn, repository-level programming assistance. CAB automatically constructs 3,286 real-world programming help scenarios from GitHub Issues, spanning 7 languages and 214 repositories, and reveals a substantial performance gap: state-of-the-art models achieve 70–83% on StackOverflow-style questions but only 7–16% on post-cutoff repositories.
ComPO: Preference Alignment via Comparison Oracles: To address likelihood displacement and verbosity caused by noisy preference pairs (where preferred and dispreferred responses are highly similar) in DPO, this paper proposes ComPO, a zeroth-order preference alignment method based on comparison oracles. The approach partitions data into clean and noisy subsets, applying DPO to the clean subset and ComPO to extract alignment signals from the noisy subset, achieving consistent improvements in LC win rate on benchmarks such as AlpacaEval 2.
Conformal Online Learning of Deep Koopman Linear Embeddings: This paper proposes the COLoKe framework, which reinterprets conformal prediction as a model consistency diagnostic tool. Parameter updates are triggered only when the Koopman model's prediction error exceeds a dynamically calibrated threshold, enabling efficient online Koopman linear embedding learning for nonlinear dynamical systems.
Cost-Sensitive Freeze-thaw Bayesian Optimization for Efficient Hyperparameter Tuning: CFBO incorporates user-defined utility functions (cost–performance trade-offs) into the freeze-thaw Bayesian optimization framework, and combines an adaptive stopping criterion with LC mixup-based transfer learning to achieve optimal cost–performance trade-offs on multi-fidelity HPO benchmarks.
Creativity or Brute Force? Using Brainteasers as a Window into the Problem-Solving Abilities of Large Language Models: This work constructs the Braingle Brainteaser benchmark (242 math + 236 logic puzzles) and systematically evaluates LLM reasoning strategies on brainteasers. The findings reveal that models occasionally produce creative, insight-driven solutions, but frequently fall back on brute-force enumeration even when elegant solutions exist; self-correction ability is limited; and translating narrative formats into mathematical formats yields modest performance gains.
Decoupled Entropy Minimization: This paper decouples classical entropy minimization (EM) into two opposing components — the Cluster Aggregation Driving Factor (CADF, which rewards dominant classes) and the Gradient Mitigation Calibrator (GMC, which penalizes high-confidence classes) — revealing two inherent flaws of classical EM (reward collapse and easy-class bias). The proposed AdaDEM addresses these issues via normalized rewards and marginal entropy calibration, achieving significant improvements across semi-supervised learning, domain adaptation, reinforcement learning, and other tasks.
Document Summarization with Conformal Importance Guarantees: This work presents the first application of Conformal Prediction to document summarization. By calibrating a threshold on sentence importance scores, it provides rigorous statistical guarantees on user-controllable coverage ($1-\alpha$) and recall ($\beta$) for extractive summaries. The method is model-agnostic and requires only a small calibration set.
Efficient Semantic Uncertainty Quantification in Language Models via Diversity-Steered Sampling: This paper proposes a diversity-steered sampling framework that injects NLI-based semantic similarity penalties during decoding to encourage semantically diverse generation, and corrects distributional bias via importance weighting with control variates to reduce variance. The method accurately estimates semantic entropy (aleatoric uncertainty) and mutual information (epistemic uncertainty) of LLMs using as few as 16 samples.
EvaLearn: Quantifying the Learning Capability and Efficiency of LLMs via Sequential Problem Solving: This paper proposes EvaLearn, a benchmark that evaluates the learning capability and learning efficiency of LLMs through a sequential problem-solving paradigm, revealing that models with stronger static performance do not necessarily possess greater learning potential.
Exploiting Task Relationships in Continual Learning via Transferability-Aware Task Embeddings: This paper proposes H-embedding, a transferability-aware task embedding based on H-score, and integrates it into a hypernetwork framework. By explicitly modeling inter-task relationships in the embedding space to guide parameter generation, the method achieves state-of-the-art final accuracy in a rehearsal-free setting.
Exploiting Vocabulary Frequency Imbalance in Language Model Pre-training: Through controlled experiments, this paper reveals the fundamental mechanism by which larger vocabularies improve language model performance: expanding the vocabulary reduces the Kolmogorov complexity of tokenized text, exploiting vocabulary frequency imbalance to substantially lower the loss on high-frequency tokens, thereby driving down global cross-entropy and improving downstream task performance.
Generalization Error Analysis for Selective State-Space Models Through the Lens of Attention: This work unrolls selective SSMs (Mamba) into an attention-equivalent form and derives generalization bounds via covering number techniques, controlled by the spectral abscissa $s_{\mathbf{A}}$ of the continuous-time state matrix. When $s_{\mathbf{A}} < 0$, the bound is independent of sequence length; when $s_{\mathbf{A}} \geq 0$, it grows exponentially. The paper further proves this dependence is irreducible.
Heterogeneous Adversarial Play in Interactive Environments: This paper proposes HAP (Heterogeneous Adversarial Play), which formalizes teacher-student interaction as a minimax game: a teacher network automatically generates challenge tasks targeting student weaknesses, while the student policy continuously adapts and evolves, forming an adaptive curriculum without manual design. HAP outperforms state-of-the-art baselines in multi-task RL environments, and the generated curriculum proves effective for human learners as well.
HouseLayout3D: A Benchmark and Training-Free Baseline for 3D Layout Estimation in the Wild: This paper introduces HouseLayout3D—the first real-world 3D layout estimation benchmark targeting large-scale multi-floor buildings—and MultiFloor3D, a training-free baseline that combines modern 3D reconstruction and segmentation models to surpass existing deep learning methods on multi-floor building layout estimation.
HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization: This paper proposes HybridNorm, a hybrid normalization strategy that applies QKV normalization within the attention module to decouple gradients and Post-Norm within the FFN to enhance regularization. Across scales from 550M to 7B parameters, HybridNorm simultaneously achieves the training stability of Pre-Norm and the generalization performance of Post-Norm, yielding an average downstream task improvement of 2.45% at the 7B scale.
Hyperbolic Fine-Tuning for Large Language Models: This work identifies that LLM token embeddings follow power-law distributions and exhibit tree-like hyperbolic structure, and proposes HypLoRA — performing low-rank adaptation directly on the Lorentz hyperbolic manifold (bypassing the cancellation effect of tangent space mappings) — achieving significant gains over standard LoRA on arithmetic and commonsense reasoning tasks (e.g., M.AVG +7.5% on Qwen2.5-7B).
Incomplete Multi-view Clustering via Hierarchical Semantic Alignment and Cooperative Completion: This paper proposes the HSACC framework, which employs a two-level semantic space design (low-level mutual information consistency + high-level adaptive weighted fusion) combined with cooperatively optimized implicit missing-view recovery, achieving significant improvements over existing incomplete multi-view clustering methods on five benchmark datasets.
Ineq-Comp: Benchmarking Human-Intuitive Compositional Reasoning in Automated Theorem Proving on Inequalities: This paper introduces the Ineq-Comp benchmark, which applies compositionally transformed variants of simple inequality seed problems—variants that humans can resolve with minimal additional effort—to expose fundamental deficiencies in the compositional reasoning of current LLM-based formal theorem provers. Even DeepSeek-Prover-V2-7B suffers a performance drop exceeding 20%.
Keep It on a Leash: Controllable Pseudo-label Generation Towards Realistic Long-Tailed Semi-Supervised Learning: This paper proposes Controllable Pseudo-label Generation (CPG), a framework that progressively incorporates reliable pseudo-labels into the labeled set via a controllable self-reinforcing optimization cycle. By training a Bayes-optimal classifier on a distribution of known composition, CPG achieves accuracy gains of up to 15.97% in the Realistic LTSSL setting where the unlabeled data distribution is entirely unknown.
LCDB 1.1: A Database Illustrating Learning Curves Are More Ill-Behaved Than Previously Thought: This paper constructs LCDB 1.1, a large-scale high-resolution learning curve database, demonstrating that ill-behaved sample learning curves (non-monotonic, non-convex) are approximately twice as prevalent as previously believed, with roughly 15% of curves exhibiting significant ill-behavior that feature scaling largely fails to remedy.
Learning Generalizable Shape Completion with SIM(3) Equivariance: This paper proposes SIMECO, the first SIM(3)-equivariant shape completion network. Through a three-stage modular design — feature canonicalization → similarity-invariant geometric reasoning → transformation recovery — SIMECO outperforms all augmentation-based and equivariant baselines under an unbiased evaluation protocol, achieving a 17% MMD reduction on KITTI and a 14% CD-$\ell_1$ reduction on OmniObject3D. Notably, SIMECO under the stricter protocol still surpasses competing methods evaluated under their own biased settings.
Leveraging Robust Optimization for LLM Alignment under Distribution Shifts: This paper proposes DoRA (Distribution-aware Optimization for Robust Alignment), which trains a distribution classifier to assign calibrated weights to individual samples and incorporates them into a KL-DRO framework to minimize worst-case loss. DoRA operates as a model-agnostic plug-and-play module that consistently improves the robustness of various alignment algorithms—including DPO, RRHF, and LIRE—under distribution shifts.
LTD-Bench: Evaluating Large Language Models by Letting Them Draw: LTD-Bench evaluates the spatial reasoning capabilities of LLMs by having them draw (via dot-matrix output or code-based rendering), transforming abstract evaluation metrics into intuitive visual outputs. The benchmark reveals critical deficiencies in current state-of-the-art LLMs regarding bidirectional mapping between linguistic and spatial concepts.
MEIcoder: Decoding Visual Stimuli from Neural Activity by Leveraging Most Exciting Inputs: MEIcoder is proposed to leverage neuron-specific Most Exciting Inputs (MEIs) as biological priors, combined with SSIM loss and adversarial training, to achieve state-of-the-art visual stimulus reconstruction from neural population activity in the primary visual cortex (V1), with particular strengths in small-dataset and low-neuron-count regimes.
Mind the Gap: Removing the Discretization Gap in Differentiable Logic Gate Networks: This paper proposes Gumbel Logic Gate Networks (Gumbel LGNs), which inject Gumbel noise into logic gate selection and employ a straight-through (ST) estimator to reduce the discretization gap of differentiable logic gate networks by 98%, achieve a 4.5× speedup in training, and reduce the proportion of unused neurons to 0%.
Model-Behavior Alignment under Flexible Evaluation: When the Best-Fitting Model Isn't the Right One: Through large-scale model recovery experiments, this paper demonstrates that even with 4.5 million behavioral data points, flexible evaluation methods based on linear probing achieve model recovery accuracy below 80% across 20 visual models. This reveals a fundamental trade-off between predictive accuracy and model identifiability, challenging the prevailing paradigm that the best-fitting model is the most appropriate one.
Model Context Protocol for Vision Systems: Audit, Security, and Protocol Extensions: The first protocol-level audit of MCP deployment in vision systems, analyzing 91 public MCP servers and finding that 78% exhibit schema inconsistencies and 89% lack runtime validation; the paper further proposes protocol extensions including semantic schemas, visual memory, and runtime validators.
MVSMamba: Multi-View Stereo with State Space Model: This paper proposes MVSMamba, the first Mamba-based Multi-View Stereo (MVS) network, which achieves efficient intra-view and inter-view global omnidirectional feature aggregation via a reference-centered dynamic scanning strategy, attaining state-of-the-art performance on DTU and Tanks-and-Temples with superior efficiency.
Normal-Abnormal Guided Generalist Anomaly Detection: NAGL is the first framework to incorporate mixed normal-and-abnormal reference samples into Generalist Anomaly Detection (GAD). Through two attention modules—Residual Mining (RM) and Anomaly Feature Learning (AFL)—it learns transferable anomaly patterns in residual space, substantially outperforming normal-reference-only methods in cross-domain scenarios with as few as 1 anomaly reference sample.
Not All Splits Are Equal: Rethinking Attribute Generalization Across Unrelated Categories: This paper presents the first systematic evaluation of how train/test splitting strategies affect generalization performance in attribute prediction tasks. It proposes four progressively harder splitting schemes based on LLM semantic grouping, embedding similarity, embedding clustering, and ground-truth supercategory labels. The study finds that unsupervised clustering-based splitting achieves leakage reduction comparable to ground-truth supercategory splits—without requiring any annotations—while retaining substantially better predictive performance.
On Evaluating LLM Alignment by Evaluating LLMs as Judges: This paper systematically investigates the consistency between LLMs' generation capability and evaluation capability (GE-consistency), finding a strong correlation between the two rankings under a strong preference oracle (Spearman $\rho = 0.96$). Based on this finding, the authors propose the AlignEval benchmark, which measures LLM alignment by assessing LLMs' ability as judges—without directly invoking LLM-as-Judge to evaluate model outputs—achieving performance comparable to or better than AlpacaEval and Arena-Hard.
On the Entropy Calibration of Language Models: This paper systematically investigates the entropy calibration of language models — whether the entropy of generated text matches the log loss on human text — and finds that due to the power-law nature of data distributions ($\alpha \approx 1$), error accumulation improves extremely slowly with model scale (scaling exponent $\approx -0.05$). The paper further provides a theoretical proof that entropy can be calibrated in polynomial time without sacrificing diversity.
Open-Insect: Benchmarking Open-Set Recognition of Novel Species in Biodiversity Monitoring: This paper introduces Open-Insect — the first large-scale fine-grained open-set recognition benchmark for insect species discovery, spanning three geographic regions and three types of open-set splits. It systematically evaluates 38 OSR algorithms, finding that simple posterior methods (e.g., MSP) remain strong baselines in fine-grained settings, and demonstrates the critical role of domain-relevant auxiliary data in improving OSR performance.
OptiTree: Hierarchical Thoughts Generation with Tree Search for LLM Optimization Modeling: This paper proposes OptiTree, which organizes hierarchical classification and modeling thoughts for operations research (OR) problems by constructing a modeling tree, and employs tree search to adaptively decompose complex problems into sequences of simpler subproblems, achieving significant accuracy gains in optimization modeling tasks for LLMs (exceeding 10% on multiple challenging benchmarks).
PARROT: A Benchmark for Evaluating LLMs in Cross-System SQL Translation: This paper presents PARROT, a practical and realistic benchmark for cross-system SQL translation (SQL-to-SQL), comprising 598 core translation pairs (expanded to 28,003 pairs) sourced from 38 open-source benchmarks and real-world business scenarios, covering 22 production-grade database systems. The benchmark reveals that the strongest current LLMs achieve an average accuracy below 38.53%.
PaTH Attention: Position Encoding via Accumulating Householder Transformations: This paper proposes PaTH (Position encoding via accumulating Householder Transformations), a data-dependent multiplicative position encoding scheme that replaces RoPE's static rotation matrices with accumulated Householder transformations, achieving superior theoretical expressiveness and empirical language modeling performance over RoPE.
PFΔ: A Benchmark Dataset for Power Flow under Load, Generation, and Topology Variations: PFΔ is the first power flow benchmark dataset to simultaneously encompass load, generation dispatch, and topology variations. It comprises 859,800 solved instances across six grid scales, includes close-to-infeasible extreme operating conditions, and introduces a standardized evaluation task suite for systematically assessing ML methods under diverse operating conditions.
Put CASH on Bandits: A Max K-Armed Problem for Automated Machine Learning: This paper addresses the Combined Algorithm Selection and Hyperparameter Optimization (CASH) problem in AutoML. Through data-driven analysis, it reveals that HPO reward distributions are bounded and left-skewed, and proposes MaxUCB—a bandit algorithm specifically tailored to this distributional property—achieving both theoretical and empirical improvements over existing methods.
RDB2G-Bench: A Comprehensive Benchmark for Automatic Graph Modeling of Relational Databases: This paper proposes RDB2G-Bench — the first benchmark framework for evaluating relational-database-to-graph modeling methods, comprising 5 real-world RDBs, 12 prediction tasks, approximately 50,000 precomputed graph model–performance pairs, and a systematic comparison of 10 automatic graph modeling approaches.
Reliably Detecting Model Failures in Deployment Without Labels: This paper proposes D3M (Disagreement-Driven Deterioration Monitoring), a three-stage model monitoring algorithm based on variational Bayesian posterior sampling, which reliably detects model performance degradation in label-free, training-data-free deployment settings while maintaining low false positive rates under non-degrading distribution shifts.
Rethinking Evaluation of Infrared Small Target Detection: This paper systematically identifies three critical limitations in existing evaluation protocols for infrared small target detection (IRSTD), and proposes a hierarchical analysis framework comprising the hybrid-level metric hIoU, a systematic error analysis methodology, and a cross-dataset evaluation setting.
Rethinking Losses for Diffusion Bridge Samplers: This paper identifies theoretical flaws in the widely used Log Variance (LV) loss for diffusion bridge samplers—namely, that it violates the data processing inequality and its gradients are not equivalent to those of the reverse KL (rKL)—and proposes computing rKL gradients via the log-derivative trick (rKL-LD). The proposed approach consistently outperforms LV loss across multiple benchmarks while exhibiting more stable training and reduced sensitivity to hyperparameters.
RGB-to-Polarization Estimation: A New Task and Benchmark Study: This paper formally defines the novel task of estimating polarization components (S₁/S₂/S₃) from standard RGB images, establishes the first systematic benchmark encompassing both restoration-based and generative methods, and finds that pretrained MAE achieves the best overall pixel-level accuracy (PSNR 24.74). Restoration-based methods consistently outperform diffusion-based generative methods, with pretrained weight transfer identified as a critical advantage.
Risk Management for Mitigating Benchmark Failure Modes: BenchRisk: Grounded in the NIST Risk Management Framework, this work systematically analyzes 57 failure modes across 26 LLM benchmarks, proposes 196 mitigation strategies, and introduces BenchRisk—a meta-evaluation framework that scores the reliability of benchmarks themselves.
Robust Hallucination Detection in LLMs via Adaptive Token Selection: HaMI frames hallucination detection as a Multiple Instance Learning (MIL) problem, treating each generated sequence as a bag of token instances. By jointly optimizing token selection and hallucination detection, it adaptively identifies the most informative tokens, achieving substantial AUROC improvements over all existing methods across four QA benchmarks (up to 11.9%).
scMRDR: A Scalable and Flexible Framework for Unpaired Single-Cell Multi-Omics Data Integration: This paper proposes scMRDR, a framework based on β-VAE that disentangles latent representations of single-cell multi-omics data into modality-shared and modality-specific components, achieving scalable integration of unpaired multi-omics data through isometric regularization, adversarial training, and masked reconstruction loss.
Semi-Supervised Regression with Heteroscedastic Pseudo-Labels: This paper proposes an uncertainty-aware pseudo-label framework based on heteroscedastic modeling, which dynamically calibrates per-sample pseudo-label uncertainty via bilevel optimization to mitigate the negative impact of noisy pseudo-labels on regression models, achieving state-of-the-art performance on multiple SSR benchmarks.
Small Language Models as Compiler Experts: Auto-Parallelization for Heterogeneous Systems: This work systematically evaluates three language models with fewer than 1.5B parameters (gemma3, llama3.2, qwen2.5) on compiler auto-parallelization tasks. Using six inference strategies across 11 real-world kernels, the approach achieves an average speedup of 6.81x and a peak speedup of 43.25x, demonstrating that small models can serve as powerful compiler optimization reasoning engines.
SPROD: Spurious-Aware Prototype Refinement for Reliable Out-of-Distribution Detection: SPROD is a post-hoc OOD detection method designed to handle spurious correlations in training data. It subdivides each class prototype into "correctly classified" and "misclassified" subgroups (the latter sharing spurious features), combined with K-means-style refinement and distance-based (generative) scoring. Across 5 spurious-correlation OOD benchmarks, it achieves an average AUROC of 85.1% (+4.8% vs. runner-up KNN) and FPR@95 of 49.0% (−9.3% vs. runner-up).
Test-Time Adaptation by Causal Trimming: This paper proposes TACT, a method that identifies non-causal directions in the representation space via data augmentation and PCA, then removes the projections of both test representations and class prototypes along these directions at test time. This reduces model reliance on non-causal features and significantly improves prediction performance under distribution shift.
The Geometry of Cortical Computation: Manifold Disentanglement and Predictive Dynamics in VCNet: This paper proposes VCNet—a neural network architecture that simulates the macroscopic organization of the primate visual cortex—reinterpreting dual-stream separation (manifold disentanglement) and predictive coding (geodesic refinement) through the language of geometry and dynamical systems. At an extremely compact size of 0.04 MB, VCNet achieves 92.1% accuracy on Spots-10 (10% above a distilled DenseNet), and attains 74.4% on light field classification at 3.52 MB (surpassing MobileNetV2 by 2.3%).
Thought Communication in Multiagent Collaboration: This paper proposes ThoughtComm, a framework that formalizes multiagent communication as a latent variable generative model. It proves that both shared and private thoughts are identifiable under nonparametric conditions, extracts latent thoughts via a sparsity-regularized autoencoder, and feeds them back to each agent through prefix injection. ThoughtComm achieves an average improvement of 19.06% over the current SOTA Multiagent Finetuning on mathematical reasoning benchmarks.
Tight Lower Bounds and Improved Convergence in Performative Prediction: Under the performative prediction framework, this paper provides the first tight convergence rate analysis for Repeated Risk Minimization (RRM) and proposes the Affine Risk Minimizers (ARM) algorithm class, which achieves convergence over a broader problem class by leveraging data from historical training snapshots.
Time Travel is Cheating: Going Live with DeepFund for Real-Time Fund Investment Benchmarking: This paper introduces DeepFund — the first live fund investment benchmark for LLMs — which employs a multi-agent architecture (Financial Planner + Analyst Team + Portfolio Manager) connected to real-time market data, eliminating the information leakage caused by LLM "time travel" in traditional backtesting. Over 24 trading days of live testing across 9 flagship LLMs, only Grok 3 achieves positive returns, revealing fundamental limitations of current LLMs in active fund management.
Towards Implicit Aggregation: Robust Image Representation for Place Recognition in the Transformer Era: This paper proposes ImAge (Implicit Aggregation), which inserts learnable aggregation tokens at a specific layer of a Transformer backbone and leverages the intrinsic self-attention mechanism to implicitly aggregate patch features into a global descriptor, completely eliminating the need for an external aggregator. With the smallest descriptor dimensionality (6144) and fastest inference speed, ImAge surpasses SOTA methods such as SALAD and BoQ across multiple VPR benchmarks, and ranks 1st on the MSLS Challenge leaderboard.
Turbocharging Gaussian Process Inference with Approximate Sketch-and-Project: This paper proposes the ADASAP algorithm, which extends the sketch-and-project framework to large-scale GP inference via approximate subspace preconditioning, distributed computation, and Nesterov acceleration. It is the first method to scale exact GP inference beyond $>3\times10^8$ samples, while theoretically establishing condition number-free convergence guarantees for the SAP family.
Unlocking Transfer Learning for Open-World Few-Shot Recognition: A two-stage framework is proposed that combines open-set-aware meta-learning with open-set-free transfer learning, achieving the first successful application of the transfer learning paradigm to few-shot open-set recognition (FSOSR) and reaching SOTA on miniImageNet and tieredImageNet.
What Does It Take to Build a Performant Selective Classifier?: This paper presents the first finite-sample decomposition of the selective classification gap, attributing it to five sources—Bayes noise, approximation error, ranking error, statistical noise, and implementation bias—and demonstrates that monotone calibration methods have limited effect on closing this gap.
Words That Unite The World: A Unified Framework for Deciphering Central Bank Communications Globally: This paper constructs WCB, the most comprehensive central bank monetary policy corpus to date (380,000+ sentences, 25 central banks, spanning 28 years), defines three NLP tasks (stance detection, temporal classification, uncertainty estimation), and through 15,075 benchmark experiments demonstrates that models trained on aggregated multi-bank data significantly outperform single-bank training, confirming the principle that "the whole is greater than the sum of its parts."
Your Pre-trained LLM is Secretly an Unsupervised Confidence Calibrator: This paper identifies that post-training (SFT/RLHF/DPO) degrades the confidence calibration of pre-trained language models, and proposes DACA, a method that exploits the well-calibrated nature of pre-trained models by aligning confidence distributions exclusively on prediction-consistent samples, achieving label-free calibration of post-trained models with up to 15.08% ECE improvement.

🛡️ AI Safety¶

A Set of Generalized Components to Achieve Effective Poison-only Clean-label Backdoor Attacks with Collaborative Sample Selection and Triggers: This paper proposes a set of generalized components (Component A/B/C) that establish a bidirectional collaborative relationship between sample selection and trigger design, simultaneously improving the attack success rate (ASR) and stealthiness of Poison-only Clean-label Backdoor Attacks (PCBA), with strong generalizability across multiple attack types.
AI Should Sense Better, Not Just Scale Bigger: Adaptive Sensing as a Paradigm Shift: Inspired by biological sensory systems, this position paper argues that AI research must shift from simply scaling models to optimizing inputs—by dynamically adjusting sensor-level parameters (exposure, gain, multimodal configuration, etc.) to produce inputs most favorable to the model. Under ideal sensor adaptation, a small model (EfficientNet-B0, 5M parameters) can outperform a large model (OpenCLIP-H, 632M parameters), and the paper proposes a progressive formalization framework ranging from single-shot perception to closed-loop perception–action coupling.
Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond): This work constructs the Infinity-Chat dataset (26K open-ended real user queries + 31,250 human annotations) to reveal the "Artificial Hivemind" phenomenon in open-ended language model generation—characterized by severe intra-model repetition and inter-model homogeneity—and demonstrates that Reward Models and LM Judges fail to calibrate on samples with high inter-annotator preference disagreement.
Beyond Last-Click: An Optimal Mechanism for Ad Attribution: This paper analyzes the strategic manipulation vulnerabilities of the Last-Click attribution mechanism from a game-theoretic perspective—platforms can obtain unfair attribution credit by falsifying timestamps—and proposes the Peer-Validated Mechanism (PVM), in which each platform's credit depends solely on the reports of other platforms (analogous to peer review). The paper theoretically proves that PVM is dominant strategy incentive compatible (DSIC) and optimal under homogeneous settings, improving attribution accuracy from 34% to 75% in the two-platform case.
Boosting Adversarial Transferability with Spatial Adversarial Alignment: This paper proposes Spatial Adversarial Alignment (SAA), which fine-tunes a surrogate model via two modules—spatial-aware alignment and adversarial-aware alignment—to align its features with those of a witness model, achieving significant improvements in cross-architecture adversarial transferability (CNN→ViT transfer rate improved by 25–39%).
Bridging Symmetry and Robustness: On the Role of Equivariance in Enhancing Adversarial Robustness: By embedding rotation-equivariant (P4 group) and scale-equivariant convolutional layers into CNNs, this work proposes two symmetry-aware architectures — Parallel and Cascaded — that significantly improve adversarial robustness without adversarial training. Grounded in the CLEVER framework, it theoretically demonstrates that equivariant architectures compress the hypothesis space, regularize gradients, and tighten certified robustness bounds.
Causally Reliable Concept Bottleneck Models: This paper proposes C2BM (Causally reliable Concept Bottleneck Models), which organizes the concept bottleneck as a causal graph structure. By combining observational data with background knowledge, C2BM automatically learns causal relationships, achieving significantly improved causal reliability, intervention responsiveness, and fairness while maintaining classification accuracy.
Cost Efficient Fairness Audit Under Partial Feedback: Under the partial feedback setting, this paper proposes a fairness auditing framework with a novel cost model, delivering near-optimal audit algorithms for both black-box and mixture model scenarios, reducing audit cost by approximately 50% compared to natural baselines.
CTRL-ALT-DECEIT: Sabotage Evaluations for Automated AI R&D: This work extends MLE-Bench to construct 20 code-sabotage tasks and sandbagging evaluations. It finds that frontier AI agents can successfully plant backdoors and other sabotage while completing normal ML engineering tasks, and in some cases evade detection by LM monitors.
Deceptron: Learned Local Inverses for Fast and Stable Physics Inversion: This paper proposes the Deceptron bidirectional module, which learns a local inverse of a differentiable forward surrogate and introduces a Jacobian Composition Penalty (JCP). By mapping output-space residuals back to the input space, the method achieves Gauss-Newton-like preconditioned gradient updates for physics inversion, dramatically reducing iteration counts (approximately 20× speedup on Heat-1D).
DESIGN: Encrypted GNN Inference via Server-Side Input Graph Pruning: This paper proposes DESIGN, a framework that accelerates FHE-based GNN inference by approximately $2\times$ over the SEAL baseline through two-stage server-side optimization—input graph pruning and adaptive polynomial activation degree allocation—while maintaining competitive accuracy.
DictPFL: Efficient and Private Federated Learning on Encrypted Gradients: This paper proposes DictPFL, a framework that decomposes model weights into a static dictionary and a trainable lookup table, and combines this decomposition with encryption-aware pruning. DictPFL achieves full gradient protection via homomorphic encryption in federated learning while reducing communication overhead by 402–748× and training time by 28–65×, keeping total runtime within 2× of plaintext FL.
Differential Privacy for Euclidean Jordan Algebra with Applications to Private Symmetric Cone Programming: This paper proposes a general Gaussian privacy mechanism based on Euclidean Jordan Algebra (EJA) and, building upon it, designs the first differentially private algorithm for Symmetric Cone Programming (SCP), thereby resolving an important open problem on differentially private semidefinite programming posed by Hsu et al. (ICALP 2014).
Differentially Private Bilevel Optimization: Efficient Algorithms with Near-Optimal Rates: This paper systematically studies bilevel optimization under differential privacy (DP). For the convex setting, it establishes near-tight upper and lower bounds via the exponential mechanism and regularized exponential mechanism, matching the optimal rate of single-level DP-ERM. For the non-convex setting, it proposes a second-order DP method achieving state-of-the-art convergence rates that are independent of the inner-level dimension.
Differentially Private High-dimensional Variable Selection via Integer Programming: This paper proposes two pure differentially private sparse variable selection methods (top-R and mistakes) that leverage modern mixed integer programming (MIP) techniques to efficiently explore non-convex objective landscapes, achieving state-of-the-art support recovery rates in high-dimensional settings (p up to 10,000) while providing theoretical recovery guarantees.
Distributional Adversarial Attacks and Training in Deep Hedging: This paper is the first to introduce distributional adversarial attacks into the deep hedging framework. It proposes computationally tractable adversarial training methods based on Wasserstein balls (WPGD and WBPGD), achieving substantial improvements in robustness and out-of-sample performance under distribution shift and real market data.
Dual-Flow: Transferable Multi-Target, Instance-Agnostic Attacks via In-the-wild Cascading Flow Optimization: This paper proposes the Dual-Flow framework, which leverages the forward ODE flow of a pretrained diffusion model and the reverse flow of a fine-tuned LoRA velocity function to perform multi-target, instance-agnostic adversarial attacks. Through a cascading distribution shift training strategy, the method significantly improves transfer attack success rates (e.g., +34.58% from Inc-v3 to Res-152) and demonstrates strong robustness against defended models.
Efficient Fairness-Performance Pareto Front Computation: This paper proposes MIFPO, a method that efficiently computes the fairness-performance Pareto front without training complex fair representation models, by theoretically reducing the problem to a compact discrete concave optimization problem.
Efficient Verified Machine Unlearning for Distillation: This paper proposes PURGE, a framework that extends verified unlearning under SISA to the knowledge distillation (KD) setting via teacher–student constituent mapping and an incremental multi-teacher distillation strategy. When a teacher-side unlearning request is issued, only a subset of student constituents requires retraining, achieving at least $N\times$ speedup.
Enabling Differentially Private Federated Learning for Speech Recognition: Benchmarks, Adaptive Optimizers and Gradient Clipping: This work establishes the first practical benchmark for FL+DP in end-to-end ASR, achieving only 1.3%–4.6% absolute WER degradation under strong privacy guarantees by combining per-layer clipping with the layer-wise gradient normalization of the LAMB optimizer.
Enhancing Graph Classification Robustness with Singular Pooling: This paper presents the first systematic analysis of how flat pooling operators (Sum/Avg/Max) affect adversarial robustness in graph classification. It derives adversarial risk upper bounds for each operator and proposes RS-Pool—a method that constructs graph-level representations from the dominant singular vector of the node embedding matrix—achieving significant robustness improvements without sacrificing clean accuracy.
Environment Inference for Learning Generalizable Dynamical System: This paper proposes DynaInfer, a framework that infers environment labels for unlabeled trajectories by analyzing the prediction errors of a fixed neural network, enabling generalizable dynamical system learning without environment annotations. DynaInfer matches or surpasses Oracle (known-label) performance on ODE/PDE systems.
Exploration of Incremental Synthetic Non-Morphed Images for Single Morphing Attack Detection: This paper systematically investigates the effect of incrementally introducing synthetic non-morphed face images into Single Morphing Attack Detection (S-MAD) training. Results show that a moderate proportion of synthetic data (~75% increment) can improve cross-dataset generalization (EER reduced from 6.17% to 6.10%), while excessive use or training exclusively on synthetic data leads to severe performance degradation (EER rising to ~38%).
Factor Decorrelation Enhanced Data Removal from Deep Predictive Models: This paper proposes DecoRemoval, a framework that achieves data removal without full retraining via two modules: discriminability-preserving factor decorrelation (RFF-based spatial mapping with adaptive weighting) and smoothed loss perturbation. The method significantly outperforms existing approaches, particularly under out-of-distribution (OOD) settings.
Fair Minimum Labeling: Efficient Temporal Network Activations for Reachability and Equity: This paper introduces the Fair Minimum Labeling (FML) problem, which aims to design minimum-cost temporal edge activation schemes ensuring sufficient temporal-path reachability for each node group in a network to satisfy fair coverage requirements. The paper proves FML is NP-hard and inapproximable beyond a certain factor, and provides an approximation algorithm based on probabilistic tree embeddings that matches the hardness lower bound.
Fair Representation Learning with Controllable High Confidence Guarantees via Adversarial Inference: This paper proposes FRG (Fair Representation learning with high-confidence Guarantees), the first fair representation learning framework that allows users to specify a fairness threshold $\varepsilon$ and confidence level $1-\delta$. By combining VAE-based candidate selection, adversarial inference that maximizes covariance, and a Student's t-test to construct a high-confidence upper bound, FRG guarantees that $\Delta_{DP} \leq \varepsilon$ holds with probability at least $1-\delta$ for any downstream model and task.
FairContrast: Enhancing Fairness through Contrastive Learning and Customized Augmentation: FairContrast proposes a fair contrastive learning framework for tabular data. By strategically selecting positive pairs—pairing advantaged-group samples with favorable outcomes against their disadvantaged-group counterparts—and training end-to-end with supervised or self-supervised contrastive loss combined with cross-entropy loss, the framework achieves significant bias reduction with minimal accuracy loss, without introducing any additional fairness constraint losses.
Fairness-Regularized Online Optimization with Switching Costs: This paper is the first to rigorously integrate long-term fairness and action smoothness into a unified online optimization framework. It first establishes that the original problem is fundamentally intractable under standard dynamic benchmarks, then proposes FairOBD, which online-izes the fairness cost via auxiliary variables and dual mirror descent, achieving an asymptotically optimal competitive ratio under the more principled $(R, \delta)$-constrained benchmark.
Fairness under Competition: This paper is the first to study the joint fairness of multiple fair classifiers operating in a competitive environment. It theoretically demonstrates that even when each individual classifier satisfies Equal Opportunity (EO), the ecosystem as a whole may remain unfair, and that applying fairness adjustments to a biased classifier can paradoxically reduce ecosystem-level fairness.
FedFACT: A Provable Framework for Controllable Group-Fairness Calibration in Federated Learning: This paper proposes FedFACT, a framework that characterizes the structure of the Bayes-optimal fair classifier under federated learning, and reduces fair federated learning to personalized cost-sensitive learning (in-processing) and bi-level optimization (post-processing), respectively. It is the first to achieve controllable coordination between global and local fairness in multi-class settings, with convergence and generalization guarantees.
FLUX: Efficient Descriptor-Driven Clustered Federated Learning under Arbitrary Distribution Shifts: Flux extracts compact distribution descriptors on the client side (marginal $P(X)$ mean/covariance + class-conditional $P(Y|X)$ mean/covariance), performs unsupervised clustering on the server via adaptive DBSCAN to automatically determine the number of clusters and group assignments, trains cluster-specific models, and at test time matches unlabeled new clients to the optimal model using only feature descriptors — the first method to simultaneously handle four types of distribution shifts with communication overhead comparable to FedAvg.
ForensicHub: A Unified Benchmark & Codebase for All-Domain Fake Image Detection and Localization: ForensicHub introduces the first unified benchmark platform spanning all domains (Deepfake/IMDL/AIGC/Document Tampering) for fake image detection and localization, encompassing 4 tasks, 23 datasets, 42 models, 6 backbone networks, and 11 GPU-accelerated evaluation metrics. Through a modular architecture and adapter design, it bridges domain silos and conducts 16 cross-domain evaluations to derive 8 key insights.
Impact of Dataset Properties on Membership Inference Vulnerability of Deep Transfer Learning: This paper theoretically derives and empirically validates a power-law relationship between membership inference attack (MIA) vulnerability and the number of samples per class in deep transfer learning: $\log(\text{tpr}-\text{fpr}) = -\beta_S \log(S) - \beta_0$. It finds that increasing data volume reduces both average and worst-case vulnerability, but protecting the most vulnerable samples requires an extremely large amount of data.
Impact of Dataset Properties on Membership Inference Vulnerability of Deep Transfer Learning: This paper theoretically and empirically demonstrates a power-law relationship between membership inference attack (MIA) vulnerability and the number of samples per class in deep transfer learning: as the per-class sample count $S$ increases, MIA advantage decays as $S^{-1/2}$. However, the amount of data required to protect the most vulnerable samples is prohibitively large, highlighting the irreplaceable role of formal differential privacy guarantees.
Improved Balanced Classification with Theoretically Grounded Loss Functions: Two theory-driven surrogate loss families are proposed—Generalized Logit-Adjusted (GLA) loss and Generalized Class-Aware weighted (GCA) loss—providing stronger theoretical guarantees and improved empirical performance for multi-class classification under class imbalance.
Incentivizing Time-Aware Fairness in Data Sharing: This paper proposes a time-aware data sharing framework that introduces new incentive conditions (F6–F8) and two reward schemes—Time-Aware Reward Cumulation and Time-Aware Data Valuation—to ensure that participants who join a collaboration earlier receive higher-value rewards, while simultaneously preserving fairness and individual rationality.
Influence Functions for Edge Edits in Non-Convex Graph Neural Networks: This paper proposes influence functions for edge edits applicable to non-convex GNNs. By leveraging the proximal Bregman response function (PBRF), the method relaxes the convexity assumption and jointly accounts for both parameter shift and message propagation effects, supporting both edge deletion and insertion.
It's Complicated: The Relationship of Algorithmic Fairness and Non-Discrimination Provisions for High-Risk Systems in the EU AI Act: This paper systematically analyzes the complex relationship between the non-discrimination provisions for high-risk AI systems in the EU AI Act (AIA) and the field of algorithmic fairness in machine learning. It reveals critical gaps in the legal text concerning input-side bias detection, the absence of output-side protections, and standardization challenges, providing a foundational framework for interdisciplinary collaboration between computer science and law.
Keep It Real: Challenges in Attacking Compression-Based Adversarial Purification: This paper systematically evaluates compression-based adversarial purification defenses and demonstrates that the realism of reconstructed images is the critical factor for robustness—high-realism compression models maintain significant robustness under strong adaptive attacks, and this robustness is not attributable to gradient masking.
Learning-Augmented Facility Location Mechanisms for Envy Ratio: For the envy ratio objective in one-dimensional facility location, this paper designs both deterministic and randomized learning-augmented mechanisms: the deterministic $\alpha$-BIM achieves an optimal consistency–robustness tradeoff, while the randomized BAM further improves the guarantees. The paper also resolves an open problem posed by Ding et al., improving the approximation ratio of prediction-free randomized mechanisms from 2 to approximately 1.8944.
Locally Optimal Private Sampling: Beyond the Global Minimax: Under local differential privacy (LDP), this paper proposes a local minimax framework that leverages neighborhood constraints defined by a public distribution $P_0$ to derive closed-form optimal samplers, achieving consistent and significant improvements over the global minimax sampler both theoretically and empirically.
Machine Unlearning Doesn't Do What You Think: Lessons for Generative AI Policy and Research: This paper systematically identifies five fundamental mismatches between machine unlearning techniques and policy objectives in the context of generative AI, arguing that machine unlearning cannot serve as a universal solution for privacy, copyright, or safety concerns, and provides a practical conceptual framework for both ML researchers and policymakers.
MARS: A Malignity-Aware Backdoor Defense in Federated Learning: This paper proposes MARS, a defense method that quantifies the malignity of local models by computing per-neuron Backdoor Energy (BE), and leverages Wasserstein distance-based clustering to effectively identify backdoor models in federated learning.
Matchings Under Biased and Correlated Evaluations: This paper introduces a correlation parameter $\gamma$ (the degree of alignment between institutional evaluations) into a two-institution stable matching model, and analyzes how bias $\beta$ and correlation $\gamma$ jointly affect the representation ratio of disadvantaged groups. It proves that even a slight loss of correlation can cause a sharp drop in representation, and characterizes the Pareto frontier of fairness interventions.
Mitigating Disparate Impact of Differentially Private Learning through Bounded Adaptive Clipping: By introducing a tunable lower bound into adaptive gradient clipping (bounded adaptive clipping), this work prevents the clipping bound from shrinking excessively during training, thereby improving accuracy for minority groups and mitigating algorithmic unfairness under DP constraints.
Mitigating Privacy-Utility Trade-off in Decentralized Federated Learning via f-Differential Privacy: This paper proposes two privacy accounting methods for decentralized federated learning under the f-DP framework—PN-f-DP and Sec-f-LDP—which leverage hypothesis-testing-based privacy measures to consistently yield tighter privacy bounds than Rényi DP, thereby reducing noise injection and improving model utility under equivalent privacy guarantees.
Model Inversion with Layer-Specific Modeling and Alignment for Data-Free Continual Learning: This paper proposes Per-layer Model Inversion (PMI) for data-free continual learning to accelerate synthetic image generation, and mitigates the feature drift between synthetic and real data via class-level Gaussian feature modeling and contrastive learning, achieving efficient and high-quality data-free knowledge replay.
Multi-Class Support Vector Machine with Differential Privacy: This paper proposes the PMSVM framework, which exploits the single-pass data access property of all-in-one multi-class SVMs. By combining weight perturbation and gradient perturbation, PMSVM substantially reduces the privacy budget consumption of multi-class SVMs under differential privacy, achieving a superior privacy–utility trade-off.
Nearly-Linear Time Private Hypothesis Selection with the Optimal Approximation Factor: This paper presents the first hypothesis selection algorithm under the central differential privacy model that simultaneously achieves nearly-linear time complexity and the optimal approximation factor $\alpha=3$, resolving an open problem posed by Bun et al. (NeurIPS 2019).
Not All Deepfakes Are Created Equal: Triaging Audio Forgeries for Robust Deepfake Singer Identification: This paper proposes a two-stage pipeline grounded in the premise that the most harmful deepfakes are those of the highest quality. A discriminator first filters out low-quality forgeries to reduce noise; a singer identification model trained exclusively on genuine recordings then performs voiceprint matching. The pipeline consistently outperforms baselines across multiple datasets.
OmniFC: Rethinking Federated Clustering via Lossless and Secure Distance Reconstruction: OmniFC is proposed as a model-agnostic federated clustering framework that exactly reconstructs the global pairwise distance matrix over a finite field via Lagrange coded computing, enabling any centralized clustering method (K-Means / Spectral Clustering / DBSCAN / Hierarchical Clustering, etc.) to run directly on the reconstructed matrix. The framework requires only a single communication round, is inherently robust to Non-IID data, and comprehensively outperforms specialized methods such as k-FED, MUFC, and FedSC across 7 datasets.
On the Hardness of Conditional Independence Testing In Practice: This paper systematically analyzes the root causes of failure in kernel-based conditional independence (CI) testing in practice: estimation error in conditional mean embeddings is identified as the central driver of Type-I error inflation, while the inherent tension between the choice of conditioning kernel $k_C$—which is critical for test power—and its tendency to exacerbate false positives is formally characterized.
Optimal Adjustment Sets for Nonparametric Estimation of Weighted Controlled Direct Effect: This paper establishes three foundational theoretical results for the weighted controlled direct effect (WCDE): necessary and sufficient conditions for unique identifiability, derivation of the influence function for nonparametric estimation, and characterization of the optimal covariate adjustment set that minimizes asymptotic variance.
Perturbation Bounds for Low-Rank Inverse Approximations under Noise: This paper provides the first non-asymptotic spectral norm perturbation bounds for low-rank inverse approximations $\|(\tilde{A}^{-1})_p - A_p^{-1}\|$ under additive noise. Using contour integration techniques, sharp bounds are derived that depend on the eigengap, spectral decay, and noise alignment, improving upon classical full-inverse bounds by up to a factor of $\sqrt{n}$.
Position: Bridge the Gaps between Machine Unlearning and AI Regulation: This paper systematically analyzes six potential application scenarios of Machine Unlearning (MU) in compliance with the EU AI Act (AIA), identifies the technical gaps between the state of the art and actual regulatory requirements in each scenario, and calls on the research community to bridge these gaps in order to realize the potential of MU in AI governance.
Preserving Task-Relevant Information Under Linear Concept Removal: SPLINCE constructs an oblique projection that simultaneously guarantees linear guardedness (i.e., sensitive attributes cannot be predicted by any linear classifier) and exactly preserves the covariance between representations and task labels, thereby resolving the problem of existing concept erasure methods inadvertently removing task-relevant information alongside sensitive concepts.
Private Continual Counting of Unbounded Streams: This paper proposes a novel matrix factorization method based on logarithmic perturbation, achieving for the first time a differentially private continual counting algorithm that simultaneously satisfies the three properties of "unbounded streams," "smooth error," and "near-optimal asymptotic error," with variance $O(\log^{2+2\alpha}(t))$ at time step $t$ for any $\alpha > 0$.
Private Zeroth-Order Optimization with Public Data: This paper proposes the PAZO framework, which leverages public data to guide gradient approximation in private zeroth-order optimization. PAZO achieves a superior privacy-utility tradeoff compared to DP-SGD on both vision and text tasks, while delivering up to a 16× speedup.
Provable Watermarking for Data Poisoning Attacks: This paper proposes two provable watermarking schemes—post-poisoning watermarking and poisoning-concurrent watermarking—that provide transparency declaration mechanisms for data poisoning attacks. The theoretical analysis demonstrates that, under specific watermark length conditions, both watermark detectability and poisoning effectiveness can be simultaneously guaranteed.
PubSub-VFL: Towards Efficient Two-Party Split Learning in Heterogeneous Environments via Publisher/Subscriber Architecture: This paper proposes PubSub-VFL, an efficient two-party vertical federated learning framework based on a publisher/subscriber architecture. Through a hierarchical asynchronous mechanism and system-profiling-based hyperparameter optimization, it achieves 2–7× training speedup and up to 91% computational resource utilization while preserving privacy and model accuracy.
Reconstruction and Secrecy under Approximate Distance Queries: Under the approximate distance query model, this paper studies the reconstruction game from a learning-theoretic perspective, proves a geometric characterization of the optimal reconstruction error as the Chebyshev radius, and provides a complete classification of pseudo-finiteness for Euclidean convex spaces.
Rewind-to-Delete: Certified Machine Unlearning for Nonconvex Functions: This paper proposes R2D (Rewind-to-Delete), the first first-order, black-box certified machine unlearning algorithm for general nonconvex loss functions. It achieves data deletion by rewinding to an earlier checkpoint in the training trajectory and then performing gradient descent on the retained data, while providing $(ε, δ)$-certified unlearning guarantees and theoretical trade-offs among privacy, utility, and efficiency.
Robust Graph Condensation via Classification Complexity Mitigation: This paper reveals that graph condensation (GC) is fundamentally a process of reducing classification complexity, and that adversarial attacks precisely undermine this property. Based on this insight, the authors propose the MRGC framework, which enhances GC robustness through three manifold-based regularization modules: intrinsic dimensionality regularization, curvature-aware manifold smoothing, and inter-class manifold decoupling. This work represents the first systematic study of GC robustness under simultaneous perturbations of structure, features, and labels.
Sequentially Auditing Differential Privacy: This paper proposes a differential privacy auditing framework based on sequential hypothesis testing and kernel MMD statistics, enabling valid detection of privacy violations at any point during streaming mechanism outputs. The approach reduces the required sample count from 50K (as needed by existing methods) to just a few hundred, and can identify DP-SGD privacy violations within less than one full training run.
Spectral Perturbation Bounds for Low-Rank Approximation with Applications to Privacy: This paper establishes novel high-probability perturbation bounds for low-rank approximation of symmetric matrices under the spectral norm, improving upon the classical Eckart–Young–Mirsky theorem, and resolves an open problem in differentially private PCA.
Stealthy Yet Effective: Distribution-Preserving Backdoor Attacks on Graph Classification: This paper proposes DPSBA, a clean-label backdoor attack framework for graph classification that generates in-distribution trigger subgraphs via adversarial training while suppressing both structural and semantic anomalies, achieving high attack success rates with significantly improved stealthiness.
Stochastic Regret Guarantees for Online Zeroth- and First-Order Bilevel Optimization: This paper proposes a novel search direction and proves that first-order and zeroth-order online bilevel optimization algorithms built upon it achieve sublinear stochastic bilevel regret guarantees without requiring window smoothing, while improving efficiency through reduced oracle dependence, parallel updates, and zeroth-order Hessian/Jacobian estimation.
Taught Well, Learned Ill: Towards Distillation-Conditional Backdoor Attack: This paper proposes the Distillation-Conditional Backdoor Attack (DCBA) paradigm and its instantiation SCAR, which embeds a "dormant" backdoor into a teacher model via bi-level optimization. The backdoor remains undetectable on the teacher model but is activated and transferred to the student model during knowledge distillation, even when the distillation dataset is entirely clean.
The Unseen Threat: Residual Knowledge in Machine Unlearning under Perturbed Samples: This paper identifies a critical security vulnerability in machine unlearning: even when an unlearned model is statistically indistinguishable from a retrained model, applying small adversarial perturbations to forgotten samples causes the unlearned model to correctly classify them while the retrained model fails — revealing a novel privacy risk termed "residual knowledge." The authors propose RURK, a fine-tuning strategy that penalizes correct predictions on perturbed forgotten samples, effectively suppressing residual knowledge across 11 unlearning methods on CIFAR-10 and ImageNet-100.
Understanding and Improving Adversarial Robustness of Neural Probabilistic Circuits: This paper theoretically establishes that the adversarial robustness of Neural Probabilistic Circuits (NPC) depends solely on the attribute recognition model and is independent of the probabilistic circuit. Building on this finding, it proposes RNPC, which achieves provably improved robustness via class-wise inference aggregation, significantly enhancing adversarial robustness while maintaining benign accuracy.
Understanding Challenges to the Interpretation of Disaggregated Evaluations of AI: Through causal graphical modeling, this paper demonstrates that performance disparities across subgroups in disaggregated evaluations do not necessarily indicate unfairness, but may instead reflect natural consequences of distributional differences in the data-generating process. The authors recommend supplementing standard disaggregated evaluations with causal assumptions and weighted evaluation methods.
Unifying Proportional Fairness in Centroid and Non-Centroid Clustering: This paper unifies the study of proportional fairness in centroid and non-centroid clustering under a single "semi-centroid clustering" framework, establishes an impossibility theorem showing the two cannot be simultaneously achieved, and designs novel algorithms that attain constant-factor core guarantees under dual metric loss.
Unifying Re-Identification, Attribute Inference, and Data Reconstruction Risks in Differential Privacy: Under the f-DP framework grounded in hypothesis testing, this paper provides a unified characterization of three classes of privacy risks in differential privacy — re-identification, attribute inference, and data reconstruction — yielding tighter and consistent risk upper bounds that enable a 20% reduction in noise without compromising security guarantees.

💡 LLM Reasoning¶

AbbIE: Autoregressive Block-Based Iterative Encoder for Efficient Sequence Modeling: This paper proposes AbbIE, an architecture that recursively iterates the intermediate layers (Body) of a decoder-only Transformer. Trained with only 2 iterations, AbbIE achieves upward generalization at inference time by increasing the number of iterations, surpassing standard Transformers on both language modeling perplexity and zero-shot ICL benchmarks, while serving as a drop-in replacement for standard Transformers.
Adaptive Dual Reasoner: Large Reasoning Models Can Think Efficiently by Hybrid Reasoning: This paper proposes the Adaptive Dual Reasoner (ADR), which enables reasoning models to dynamically switch between fast thinking (compressing simple reasoning steps) and slow thinking (preserving depth for complex steps). Through SFT cold-start combined with EHPO (Entropy-guided Hybrid Policy Optimization), ADR achieves up to 6.1% accuracy improvement on mathematical reasoning benchmarks while reducing reasoning tokens by 49.5%–59.3%.
Are Large Reasoning Models Good Translation Evaluators? Analysis and Performance Boost: This paper presents the first systematic analysis of large reasoning models (LRMs) in MQM-based machine translation evaluation, identifying failure modes including overthinking, score overestimation, and scale-dependent sensitivity to input materials. The authors propose ThinMQM, a method that calibrates LRM reasoning by fine-tuning on synthetic human MQM annotation trajectories, reducing the thinking budget by approximately 35× while improving evaluation performance (achieving +8.7 correlation score for the 7B model).
ARM: Adaptive Reasoning Model: ARM enables models to adaptively select among four reasoning formats (Direct Answer, Short CoT, Code, Long CoT) and introduces Ada-GRPO to address format collapse during training, achieving comparable accuracy to pure Long CoT models while reducing token usage by ~30% on average and up to ~70% on simple tasks.
Atom of Thoughts for Markov LLM Test-Time Scaling: This paper proposes Atom of Thoughts (AoT), which models LLM reasoning as a Markov chain where each state is a self-contained subproblem that is answer-equivalent to the original question but of strictly lower complexity. A two-phase transition mechanism based on DAG decomposition and contraction eliminates historical dependencies. AoT integrates seamlessly with existing methods such as ToT and reflection, achieving state-of-the-art performance across six benchmarks spanning mathematics, code, and multi-hop QA.
Beyond Chemical QA: Evaluating LLM's Chemical Reasoning with Modular Chemical Operations: This paper introduces ChemCoTBench, the first CoT-based benchmark for evaluating chemical reasoning in LLMs. It decomposes complex chemical problems into modular chemical operations (adding/deleting/substituting functional groups), and is accompanied by ChemCoTDataset — a large-scale dataset of 22,000 expert-annotated CoT samples — enabling systematic evaluation of both reasoning and non-reasoning LLMs across molecular understanding, editing, optimization, and reaction prediction.
Clip-and-Verify: Linear Constraint-Driven Domain Clipping for Accelerated Neural Network Verification: This paper proposes the Clip-and-Verify verification pipeline, which leverages linear constraints generated "for free" during linear bound propagation. Two GPU-efficient algorithms—complete clipping (coordinate ascent dual solving) and relaxed clipping (closed-form input domain shrinkage)—are used to tighten intermediate-layer bounds across the entire network. The approach reduces the number of BaB subproblems by up to 96% on multiple benchmarks, and serves as a core component of the VNN-COMP 2025 winning verifier.
Controlling Thinking Speed in Reasoning Models: By applying Representation Engineering (RepE) to extract steering vectors that control fast/slow thinking transitions from the hidden space of Large Reasoning Models (LRMs), and combining these with a real-time reasoning difficulty estimator based on inter-layer logit divergence, the method achieves training-free adaptive reasoning speed control — yielding an average of +1.3% accuracy improvement and −8.6% token reduction across 4 LRMs.
CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring: This paper systematically evaluates the effectiveness of Chain-of-Thought monitoring within the AI Control framework. It finds that CoT monitoring outperforms action-only monitoring by +10pp on subtle sabotage tasks, but underperforms by −25pp on non-subtle tasks (due to deceptive rationalizations in reasoning misleading the monitor). A hybrid monitoring protocol—independently scoring CoT and action then combining via weighted fusion—consistently outperforms either approach alone across all scenarios, achieving up to a 2× improvement in detection rate.
Curriculum Abductive Learning: This paper proposes Curriculum Abductive Learning (C-ABL), which partitions a knowledge base into sub-knowledge-bases according to its dependency structure and introduces them progressively during training. This substantially reduces the abduction search space in ABL, significantly improving training stability, convergence speed, and final accuracy.
DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization: This paper analyzes the GRPO objective and reveals two inherent issues: difficulty bias (underweighting questions that are too hard or too easy) and entropy instability. It proposes DisCO, a discriminative constrained optimization framework that addresses these issues via a clip-free scoring function, squared hinge constrained optimization, and distributionally robust optimization (DRO) for imbalanced rollouts. On 1.5B models, DisCO outperforms GRPO by 7% and DAPO by 6% on average.
Does Thinking More Always Help? Mirage of Test-Time Scaling in Reasoning Models: Through systematic experiments, this paper reveals that the performance of test-time scaling in LRMs (achieved by repeatedly appending "Wait" prompts to extend reasoning) exhibits a non-monotonic pattern of initial improvement followed by degradation. A probabilistic model is then used to demonstrate that this apparent "gain" is merely a mirage caused by increased output variance rather than genuine reasoning improvement. The proposed parallel thinking strategy achieves accuracy improvements of up to 22% under the same token budget.
DreamPRM: Domain-Reweighted Process Reward Model for Multimodal Reasoning: DreamPRM is proposed to automatically learn domain weights for multimodal reasoning datasets via bi-level optimization, addressing the data quality imbalance in PRM training. It achieves 85.2% top-1 accuracy on the MathVista leaderboard using the o4-mini model.
Exact Expressive Power of Transformers with Padding: This paper provides an exact characterization of the expressive power of Transformers with padding: fixed depth combined with polynomial padding is precisely equivalent to $\mathsf{FO}$-uniform $\mathsf{TC}^0$; further combined with $O(\log^d n)$ looping, this is precisely equivalent to $\mathsf{FO}$-uniform $\mathsf{TC}^d$; and polylog looping converges to $\mathsf{NC}$. These results establish a complete theoretical foundation for padding and looping as parallel inference-time computation mechanisms.
ExPO: Unlocking Hard Reasoning with Self-Explanation-Guided Reinforcement Learning: This paper proposes Self-Explanation Policy Optimization (ExPO), a modular framework that addresses the fundamental challenge of distribution sharpening in RL post-training methods such as GRPO. When the model's initial success rate on hard reasoning tasks is near zero, effective positive samples are unavailable for learning. ExPO resolves this by prompting the model to generate reasoning chains (self-explanations) conditioned on the ground-truth answer. The resulting self-explanation samples are both in-distribution with respect to the current policy and provide positive learning signals. ExPO integrates seamlessly into both DPO and GRPO frameworks.
GPO: Learning from Critical Steps to Improve LLM Reasoning: GPO estimates the advantage function for each step in a reasoning trajectory via Monte Carlo simulation to identify "critical steps" (the turning points where the model makes errors), then resets from those critical steps and resamples new trajectories for training. This plug-and-play approach consistently improves multiple optimization algorithms—including PPO, DPO, KTO, SimPO, and ORPO—on reasoning tasks.
I-RAVEN-X: Benchmarking Generalization and Robustness of Analogical and Mathematical Reasoning in Large Language and Reasoning Models: This paper introduces I-RAVEN-X, an enhanced symbolic reasoning benchmark that evaluates the generalization and robustness of analogical and mathematical reasoning in LLMs and LRMs by increasing operand complexity, attribute range, and perceptual uncertainty. Results show that LRMs significantly outperform LLMs under deterministic reasoning, but suffer sharp performance degradation under uncertain reasoning conditions.
Inference-Time Chain-of-Thought Pruning with Latent Informativeness Signals: This paper proposes KAPPA (KL-Adjusted Pruned Path Algorithm), which progressively prunes reasoning branches in Best-of-N sampling using three training-free signals — KL divergence, confidence, and entropy — achieving up to 60% peak memory reduction and 90% token generation reduction while maintaining accuracy.
Note 1: Is CoT a Hallucination? A Data Distribution Perspective: By constructing a fully controlled abstract environment DataAlchemy, this paper reveals that CoT reasoning is a form of hallucination — its effectiveness is entirely governed by training data distribution and proves extremely fragile under out-of-distribution scenarios.
Know What You Don't Know: Uncertainty Calibration of Process Reward Models: This paper proposes a quantile regression-based calibration method for PRMs, enabling their output scores to more accurately reflect the actual success probability of LLM reasoning. Building on the calibrated PRM, the paper further introduces an Instance-Adaptive Scaling (IAS) strategy for inference-time computation, achieving significant cost reduction while maintaining accuracy.
Large Language Models Can Learn and Generalize Steganographic Chain-of-Thought under Process Supervision: This paper demonstrates that LLMs under RL training with CoT process supervision (penalizing specific strings) spontaneously learn steganography—concealing prohibited reasoning steps via substitute encodings. These encodings are causally load-bearing and generalize to strings never encountered during training.
Latent Chain-of-Thought for Visual Reasoning: This paper reformulates visual CoT reasoning as a posterior inference problem and proposes LaCoT, a training framework based on amortized variational inference (AVI) comprising reference-guided GFlowNet fine-tuning (RGFN), token-level reward approximation, and Bayesian inference scaling (BiN). On Qwen2.5-VL 3B/7B, LaCoT outperforms GRPO by 10.6% and achieves open-source state-of-the-art across seven visual reasoning benchmarks.
Let LRMs Break Free from Overthinking via Self-Braking Tuning: This paper proposes the Self-Braking Tuning (SBT) framework, which identifies overthinking patterns in reasoning traces and constructs adaptive-length training data to teach large reasoning models (LRMs) to autonomously determine when to stop reasoning. SBT reduces token consumption by 30%–60% on mathematical reasoning tasks while maintaining accuracy.
Let Me Think! A Long Chain-of-Thought Can Be Worth Exponentially Many Short Ones: This paper demonstrates theoretically and empirically that there exist reasoning tasks (graph connectivity) for which a single long CoT (sequential scaling) is equivalent in capability to exponentially many short CoTs (parallel scaling)—i.e., reducing CoT length by even a small amount requires an exponential increase in parallel samples to achieve the same accuracy.
LIMOPro: Reasoning Refinement for Efficient and Effective Test-time Scaling: This paper proposes PIR (Perplexity-based Importance Refinement), a framework that categorizes reasoning chains distilled from LRMs into "progressive reasoning" and "functional steps" (verification / multi-method validation / error correction), and prunes only functional steps with low PIR scores while preserving the progressive reasoning backbone intact. Fine-tuning on the refined data improves accuracy by 0.9%–6.6% on AIME/AMC/GPQA while reducing token usage by 3%–41%, yielding up to 71% efficiency gain.
Lost in Transmission: When and Why LLMs Fail to Reason Globally: This paper proposes the Bounded Attention Prefix Oracle (BAPO) computational framework, which models LLM attention heads as finite-bandwidth communication channels. It proves that global reasoning problems such as graph reachability are BAPO-hard (requiring super-constant bandwidth), and shows that Chain-of-Thought (CoT) can transform any BAPO-hard problem into a BAPO-easy one. Theoretical predictions are validated experimentally on GPT-4o, Claude, and Gemini.
Many LLMs Are More Utilitarian Than One: A controlled study across six LLMs identifies a "Utilitarian Boost" phenomenon: LLMs engaged in dyadic or triadic moral deliberation are more likely than their solo counterparts to endorse harming a minority for the benefit of the majority. This effect is especially pronounced in personal dilemmas involving direct harm ($\beta=0.31, p<.0001$), and the underlying mechanisms differ across models—some exhibit reduced norm sensitivity, others heightened impartiality.
Mapping Faithful Reasoning in Language Models: This paper proposes the Concept Walk framework, which tracks the evolution of internal concept representations across reasoning steps by projecting residual stream activations at each step onto concept directions learned from contrastive data, thereby distinguishing whether a CoT chain genuinely participates in computation or merely serves as post-hoc decorative output.
Mind the Gap: Bridging Thought Leap for Improved Chain-of-Thought Tuning: This paper provides the first systematic formalization of the "Thought Leap" phenomenon in CoT reasoning chains, and proposes CoT-Bridge, a model that automatically detects and fills omitted intermediate steps. It achieves up to +5.87% improvement on NuminaMath and can serve as a plug-and-play module to enhance distillation and RL pipelines.
On Learning Verifiers and Implications to Chain-of-Thought Reasoning: This paper proposes a formal PAC learning framework for Chain-of-Thought verifiers, defining three progressively stronger verification objectives (Simple → Trustable → γ-Trustable). It proves that when each problem admits only a bounded number of correct proofs, the sample complexity is $O(\log|H|)$; however, when the number of correct proofs is unbounded, the sample complexity inevitably grows to $\Theta(|H|)$, unless the verifier class satisfies additional structural assumptions such as intersection-closure. The paper also exploits the USAT problem to demonstrate a computational complexity gap between verification and generation.
One Token Embedding Is Enough to Deadlock Your Large Reasoning Model: This paper proposes the Deadlock Attack, which optimizes a single adversarial token embedding and implants it into a Large Reasoning Model (LRM) via a backdoor mechanism, causing the model to enter a permanent reasoning loop during inference (endlessly generating transition words such as "Wait" and "But"). The attack achieves a 100% attack success rate across 4 LRMs and 3 mathematical reasoning benchmarks, with negligible performance degradation on clean inputs.
ProofSketch: Efficient Verified Reasoning for Large Language Models: ProofSketch is a framework that combines symbolic closure-based forward reasoning, compact sketch generation, and formal verification in a multi-stage pipeline, achieving formal correctness guarantees for logical reasoning while reducing token consumption.
Provable Scaling Laws for the Test-Time Compute of Large Language Models: This paper proposes two two-stage test-time compute algorithms — Knockout (pairwise elimination in a tournament bracket) and League (ranking by average win rate) — and proves under minimal assumptions that the failure probability decays exponentially or as a power law to zero as test-time compute increases. The assumptions required are merely that (1) the LLM generates a correct solution with nonzero probability, and (2) the LLM's pairwise comparisons are better than random. The entire pipeline requires only a black-box LLM, with no external verifier or reward model.
Re-FORC: Adaptive Reward Prediction for Efficient Chain-of-Thought Reasoning: This paper proposes Re-FORC, a lightweight adapter that predicts the future expected reward $\psi(t|x,z,\pi)$ in real time during CoT reasoning. The framework models reasoning compute allocation as a Pandora's box problem, enabling adaptive early stopping (26% compute savings), joint model-and-compute selection (+4% accuracy at equal compute, or −55% compute at equal accuracy), and test-time compute scaling (+11% accuracy). Users can freely adjust the accuracy–efficiency trade-off at inference time via a cost coefficient $\lambda$, without any retraining.
RealMath: A Continuous Benchmark for Evaluating Language Models on Research-Level Mathematics: This paper introduces RealMath, a continuously refreshable benchmark that automatically extracts verifiable mathematics problems from arXiv papers and Math StackExchange, designed to evaluate LLMs on real-world research-level mathematical tasks.
ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs: ReasonFlux-PRM identifies that existing PRMs fail to effectively evaluate the intermediate thinking trajectories of reasoning models, and proposes a trajectory-aware PRM that fuses step-level alignment, quality, and coherence scores with a trajectory-level template-guided reward. The approach consistently outperforms strong baselines including Qwen2.5-Math-PRM-72B across three settings: offline data selection (SFT +12.1%), online RL reward (+4.5%), and test-time Best-of-N scaling (+6.3%).
Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought: This paper theoretically demonstrates the expressive advantage of continuous chain-of-thought (Coconut) on directed graph reachability: a two-layer Transformer using $D$ continuous thought steps suffices to solve graph reachability with diameter $D$, whereas discrete CoT requires $O(n^2)$ steps. The core mechanism is that continuous thought vectors encode multiple search frontiers simultaneously in a "superposition state," enabling implicit parallel BFS.
Reasoning Models Better Express Their Confidence: This paper systematically demonstrates that reasoning models (with extended CoT) exhibit significantly better confidence calibration than non-reasoning models, and identifies "slow-thinking" behaviors—exploring alternatives, backtracking, and verification—as the fundamental source of this calibration improvement.
Reasoning Models Hallucinate More: Factuality-Aware Reinforcement Learning for Large Reasoning Models: This paper reveals that RL-trained reasoning models (e.g., DeepSeek-R1) hallucinate significantly more than non-reasoning models, theoretically identifies three root causes (high-variance gradients, entropy constraints, and spurious local optima), and proposes the FSPO algorithm, which adjusts token-level advantages via step-level factuality verification to reduce hallucination while maintaining or even improving reasoning capability.
Rethinking Optimal Verification Granularity for Compute-Efficient Test-Time Scaling: This paper proposes Variable Granularity Search (VG-Search), which unifies Beam Search and Best-of-N under a tunable verification granularity parameter $g$. It demonstrates that conventional per-step verification is suboptimal, and that adaptively adjusting $g$ can improve accuracy by 3%+ while reducing computation by 52%+.
SafePath: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment: SafePath proposes fine-tuning only an 8-token "Safety Primer" ("Let's think about safety first") at the very beginning of the reasoning chain, effectively steering Large Reasoning Models (LRMs) toward safe reasoning paths. On DeepSeek-R1-Distill, it reduces harmful outputs by 90% while requiring only 1/296 of the training compute of Direct Refusal.
Sampling-Efficient Test-Time Scaling: Self-Estimating the Best-of-N Sampling in Early Decoding: This paper proposes Self-Truncation Best-of-N (ST-BoN), a decoding method that leverages a theoretical guarantee showing early hidden-state consistency predicts final consistency, enabling identification and truncation of suboptimal samples at early decoding steps. ST-BoN reduces memory usage by over 80% and latency by ~50% while preserving standard BoN performance.
Scalable Best-of-N Selection for Large Language Models via Self-Certainty: This paper proposes Self-Certainty, a metric that quantifies model confidence via the token probability distribution of LLM outputs, enabling scalable Best-of-N selection without any auxiliary reward model. The approach achieves performance comparable to or exceeding reward-model-based methods.
Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Models: This paper proposes the SPO framework, which adopts segment-level (rather than token-level or trajectory-level) advantage estimation. Through a novel Monte Carlo method and tree-based sampling, SPO outperforms PPO and GRPO by 6–12 and 7–11 percentage points in short-CoT and long-CoT settings, respectively.
Note 8: PolyMath — Evaluating Mathematical Reasoning in a Multilingual Context: PolyMath introduces a mathematical reasoning benchmark spanning 18 languages, 4 difficulty levels, and 500 problems, revealing that: (1) reasoning performance varies by up to 10 points across languages; (2) reasoning models exhibit low input–output language consistency, which may affect performance; and (3) thinking length varies substantially across languages — offering new perspectives for multilingual reasoning research.
Smaller Models, Smarter Rewards: A Two-Sided Approach to Process and Outcome Rewards: The final layer of Phi-4 family small models (3.8B/14B) is replaced with a regression head and fine-tuned, enabling them to serve simultaneously as ORM (outcome reward model) and PRM (process reward model). On code generation tasks, selecting the optimal rollout yields 20%+ improvements in pass@k.
SPRINT: Enabling Interleaved Planning and Parallelized Execution in Reasoning Models: By restructuring long chain-of-thought reasoning traces into interleaved planning and parallel execution stages, Sprint reduces sequential token counts by up to 39% on in-distribution tasks (up to 65% on OOD tasks) while maintaining accuracy, enabling dynamic parallelization of the reasoning process.
SQL-of-Thought: Multi-agentic Text-to-SQL with Guided Error Correction: This paper proposes SQL-of-Thought, a multi-agent Text-to-SQL framework that decomposes the task into schema linking → subproblem identification → CoT query plan generation → SQL generation → guided correction loop based on a 31-category error taxonomy. Using Claude 3 Opus on the Spider benchmark, it achieves 91.59% execution accuracy, outperforming the previous best Chase SQL (87.6%) by nearly 4 percentage points.
SQL-R1: Training Natural Language to SQL Reasoning Model By Reinforcement Learning: This work presents the first systematic application of GRPO-based reinforcement learning to NL2SQL tasks. Through a four-level progressive reward function and a training strategy combining 200K cold-start data with 5K complex-sample RL fine-tuning, the 7B model achieves 88.7% on Spider and 66.6% on BIRD, surpassing GPT-4-based methods at comparable scale.
Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning: PURE identifies the root cause of reward hacking induced by PRMs as the standard sum-form credit assignment in RL ($V(s) = \sum \gamma^t r_t$), and proposes a min-form alternative ($V(s) = \min_{t' \geq t} r_{t'}$). By constraining the value function to the minimum of future rewards rather than their cumulative sum, PURE significantly mitigates reward hacking—achieving reasoning performance comparable to rule-based reward methods using only 30% of the training steps.
The Hawthorne Effect in Reasoning Models: Evaluating and Steering Test Awareness: This work presents the first systematic quantification of "test awareness" (the Hawthorne effect) in reasoning-oriented LLMs: models alter their behavior upon detecting that they are being evaluated. The paper localizes awareness-related activations via linear probes and applies parameter editing for steering, revealing that test awareness exerts a significant yet directionally inconsistent influence on safety alignment.
The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity: Using controlled puzzle environments, this paper systematically reveals a three-regime behavioral pattern in Large Reasoning Models (LRMs): performance falls below standard LLMs at low complexity (overthinking), substantially surpasses them at moderate complexity, and collapses completely (0%) at high complexity. Counterintuitively, models reduce thinking token usage at the point of collapse, demonstrating that current LRMs have not developed genuinely generalizable reasoning capabilities.
The Impact of Quantization on Large Reasoning Model Reinforcement Learning: This paper presents a systematic empirical study showing that quantization-aware fine-tuning (QAFT/STE) during RL training of large reasoning models (LRMs) degrades reasoning capability, whereas post-training quantization (PTQ) and QLoRA preserve reasoning performance well even at 4-bit precision. The authors recommend a practical pipeline of full-precision RL training followed by PTQ quantization.
The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning: This paper decomposes Reinforcement Learning from Verifiable Rewards (RLVR) into Positive Sample Reinforcement (PSR, which increases the probability of correct responses) and Negative Sample Reinforcement (NSR, which penalizes incorrect responses). The authors find that NSR alone consistently improves reasoning performance across the entire Pass@k spectrum and typically matches or surpasses PPO/GRPO. Based on this finding, the paper proposes Weighted-REINFORCE (reducing the PSR weight to 0.1), which achieves state-of-the-art results across MATH, AIME 2025, and AMC23.
The Virtues of Brevity: Avoid Overthinking in Parallel Test-Time Reasoning: This paper demonstrates that selecting the shortest solution in Best-of-N sampling for reasoning models is a simple yet counterintuitive and effective heuristic, achieving performance comparable to self-consistency at significantly lower token cost. The underlying mechanism exploits a systematic bias in reasoning models between a "conventional mode" and an "overthinking mode."
ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing: ThinkSound is a three-stage interactive video-to-audio framework that leverages an MLLM to generate structured CoT reasoning as guidance for a unified audio foundation model. It achieves state-of-the-art performance on VGGSound and MovieGen Audio benchmarks while supporting object-level refinement and natural language instruction-based editing.
TimE: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios: This paper introduces TimE, a multi-level temporal reasoning benchmark comprising 38,522 QA pairs across three real-world scenarios — knowledge-intensive (Wiki), dynamic news (News), and long dialogue (Dial) — and three progressively difficult levels with 11 sub-tasks. A comprehensive evaluation of 24 LLMs reveals that even the strongest reasoning models exhibit significant deficiencies on complex tasks such as timeline construction and counterfactual reasoning.
TimE: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios: This paper introduces TimE, a multi-level temporal reasoning benchmark comprising 38,522 QA pairs across three real-world scenarios — knowledge-intensive (Wiki), dynamic events (News), and multi-turn dialogue (Dial) — with 11 fine-grained subtasks for systematic evaluation of LLMs' temporal reasoning capabilities. A manually annotated subset, TimE-Lite, is also released.
Topology of Reasoning: Understanding Large Reasoning Models through Reasoning Graph Properties: This paper introduces the concept of a reasoning graph — a directed graph constructed by clustering the hidden states of LLMs — and analyzes large reasoning models (e.g., the DeepSeek-R1 distillation series) along three graph-theoretic dimensions: cycle density, diameter, and small-world index. Reasoning models are found to exhibit significantly more cycles (~5 per sample), larger diameters, and stronger small-world properties (~6×), all of which grow with task difficulty and model scale.
Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning: This paper demonstrates that excessively extending CoT length degrades LLM reasoning performance, and proposes Thinking-Optimal Scaling (TOPS), a strategy that trains models to select the shortest correct response for each problem via self-improvement, outperforming existing distillation methods in both accuracy and efficiency.
Transformers Provably Learn Chain-of-Thought Reasoning with Length Generalization: This paper provides the first optimization-theoretic proof that a one-layer Transformer trained via gradient descent can learn CoT reasoning on a synthetic state-tracking task and achieve length generalization. It is the first work to establish convergence guarantees for constant-depth Transformers learning $\mathsf{NC}^1$-complete problems, going beyond prior theory that was limited to $\mathsf{TC}^0$.
TTS-VAR: A Test-Time Scaling Framework for Visual Auto-Regressive Generation: This paper proposes TTS-VAR — the first test-time scaling framework specifically designed for Visual Auto-Regressive (VAR) models. It formulates image generation as a path searching problem and achieves an 8.7% improvement on GenEval (0.69 → 0.75) with Infinity 2B by combining adaptive descending batch sizes, early-stage clustering-based diversity search, and late-stage resampling-based potential selection. With $N=2$, TTS-VAR already surpasses Best-of-N at $N=8$.
Two-Stage Learning of Stabilizing Neural Controllers via Zubov Sampling and Iterative Domain Expansion: A two-stage training framework is proposed: the first stage estimates the region of attraction (ROA) via Zubov-guided sampling and dynamic domain expansion, while the second stage refines the result through CEGIS-based counterexample-driven training. The framework jointly learns a neural network controller and a Lyapunov function, achieving ROA volumes 5 to $1.5 \times 10^5$ times larger than baselines and verification speeds 40–10000× faster than dReal.
Unlabeled Data Can Provably Enhance In-Context Learning of Transformers: This paper proposes an augmented ICL framework in which the prompt contains both a small set of labeled examples and a large collection of unlabeled examples. It theoretically proves that a multi-layer Transformer, via chain-of-thought (CoT) reasoning, can simulate the EM algorithm to extract information from unlabeled data, improving the classification excess risk from $\mathcal{O}(1/\sqrt{N})$ to $\mathcal{O}(1/\sqrt{N + \text{poly}(M)})$.
Unlocking Multimodal Mathematical Reasoning via Process Reward Model: This paper proposes URSA, a three-stage framework that sequentially constructs a million-scale multimodal CoT dataset (MMathCoT-1M) for base model training, a dual-perspective process supervision dataset (DualMath-1.1M) for PRM training, and a PS-GRPO algorithm that integrates the PRM into online RL. The resulting 8B model surpasses GPT-4o by an average of 2.7% across six mathematical benchmarks.
Note 6: Self-Evaluating LLMs - Step-Level Confidence Estimation for Multi-Step Tasks: This paper extends confidence estimation to multi-step tasks, demonstrating that step-level evaluation detects reasoning failures more effectively than response-level evaluation, achieving a 15% relative AUC-ROC improvement over holistic evaluation on CoQA, and providing a practical framework for trustworthy deployment of multi-step reasoning systems.
Visual Thoughts: A Unified Perspective of Understanding Multimodal Chain-of-Thought: This paper proposes "Visual Thoughts" as a unified framework for interpreting the effectiveness of multimodal chain-of-thought reasoning (MCoT). The core mechanism underlying performance gains in both textual MCoT (T-MCoT) and interleaved multimodal MCoT (I-MCoT) is the caching and transfer of visual information into the reasoning process. The paper defines four forms of visual thought expressions and reveals their role as image-to-reasoning intermediaries in deep Transformer layers.

� LLM Safety¶

A Cramér–von Mises Approach to Incentivizing Truthful Data Sharing: This paper proposes an incentive mechanism based on the Cramér–von Mises (CvM) two-sample test statistic. Under both Bayesian and prior-free settings, the mechanism provably makes truthful data submission a (approximate) Nash equilibrium, while encouraging participants to contribute more genuine data—without relying on strong distributional assumptions (e.g., Gaussian or Bernoulli).
A Reliable Cryptographic Framework for Empirical Machine Unlearning Evaluation: This paper models the machine unlearning evaluation problem as a cryptographic game (the unlearning sample inference game), quantifies unlearning quality via the adversary's "advantage," and addresses multiple shortcomings of traditional MIA accuracy as an evaluation metric—namely, the lack of a retrain-as-zero baseline, sensitivity to data partitioning, and sensitivity to the choice of MIA. A SWAP test is further proposed as an efficient practical approximation.
Adaptive LoRA Experts Allocation and Selection for Federated Fine-Tuning: This paper proposes FedLEASE, which addresses two critical challenges in federated LoRA fine-tuning: (1) automatically determining the optimal number of experts and their assignment via LoRA B-matrix similarity clustering, and (2) enabling adaptive top-M expert selection through an expanded routing space of $2M-1$ dimensions, allowing each client to determine how many experts to use. FedLEASE achieves an average improvement of 5.53% over the strongest baseline on GLUE.
Adversarial Paraphrasing: A Universal Attack for Humanizing AI-Generated Text: This paper proposes Adversarial Paraphrasing — a training-free universal attack framework that selects the most "human-like" token at each decoding step by leveraging feedback signals from AI text detectors during token-by-token paraphrasing. The approach achieves an average T@1%F reduction of 87.88% across 8 detectors and exhibits strong cross-detector transferability.
AgentStealth: Reinforcing Large Language Model for Anonymizing User-generated Text: This paper proposes the AgentStealth framework, which trains a small language model (SLM) through a three-stage pipeline comprising an adversarial anonymization workflow, supervised fine-tuning (SFT), and online reinforcement learning, achieving effective anonymization of user-generated content while preserving text utility — yielding a 12.3% improvement in anonymization performance and 6.8% improvement in utility.
ALMGuard: Safety Shortcuts and Where to Find Them as Guardrails for Audio-Language Models: The first defense framework against jailbreak attacks on audio-language models (ALMs). The work discovers that aligned ALMs possess latent safety shortcuts that can be activated, and proposes a Mel Gradient Sparse Mask (M-GSM) to identify critical frequency bins. By applying Shortcut Activation Perturbations (SAP) to these bins, the average attack success rate is reduced from 41.6% to 4.6% with negligible degradation of normal task performance.
Angular Steering: Behavior Control via Rotation in Activation Space: This paper proposes Angular Steering, which unifies LLM activation steering as rotation operations within a fixed 2D subspace. By parameterizing behavior control through rotation angle, it provides a continuous, fine-grained, norm-preserving knob spanning 0°–360°, while unifying activation addition and directional ablation as special cases of rotation. The approach achieves robust behavior control on Llama 3, Qwen 2.5, and Gemma 2 (3B–14B).
Bits Leaked per Query: Information-Theoretic Bounds on Adversarial Attacks Against LLMs: This paper models adversarial attacks on LLMs as an information channel problem — defining the "bits leaked per query" $I(Z;T)$ as the mutual information between the attack target attribute $T$ and the observable signal $Z$, and proving that the minimum number of queries required to achieve error $\varepsilon$ is $\log(1/\varepsilon)/I(Z;T)$. Validated across 7 LLMs: exposing only answer tokens requires ~1000 queries; adding logits reduces this to ~100; adding chain-of-thought (CoT) further reduces it to ~tens of queries. This provides the first principled metric for the transparency–security trade-off.
Buffer Layers for Test-Time Adaptation: This paper proposes Buffer layers as a new paradigm for Test-Time Adaptation (TTA), replacing conventional normalization layer updates to fundamentally preserve the integrity of the pretrained backbone. The approach effectively alleviates catastrophic forgetting and achieves consistent performance improvements across diverse architectures and TTA frameworks.
Collective Narrative Grounding: Community-Coordinated Data Contributions to Improve Local AI Systems: This paper proposes the Collective Narrative Grounding protocol, which collects community narratives through participatory workshops and structures them into "narrative units." A RAG pipeline then injects this local knowledge into LLM-based QA systems. Experiments on LocalBench reveal that 76.7% of errors can be directly remediated by local narratives, and GPT-5 achieves only 21% accuracy on the participatory QA set, highlighting the severity of the local knowledge gap.
Contextual Integrity in LLMs via Reasoning and Reinforcement Learning: This paper proposes CI-RL, a framework that combines Chain-of-Thought reasoning prompts with GRPO reinforcement learning to train LLMs to understand contextual integrity (CI) using only ~700 synthetic samples. On the PrivacyLens benchmark, it reduces privacy leakage rates by up to 40%, and smaller models trained with CI-RL can surpass larger baseline models.
CoreGuard: Safeguarding Foundational Capabilities of LLMs Against Model Stealing in Edge Deployment: This paper proposes CoreGuard, which locks Transformer linear layer weights via row permutation and reduces TEE authorization to a single invocation through a column-permutation propagation protocol, protecting foundational capabilities of edge-deployed LLMs against model stealing attacks with negligible computational and communication overhead.
CPRet: A Dataset, Benchmark, and Model for Retrieval in Competitive Programming: To address the prevalence of duplicate and near-duplicate problems in competitive programming—which compromises contest fairness and inflates LLM evaluation scores—this work constructs CPRet, a large-scale benchmark spanning four retrieval tasks, and proposes CPRetriever, a domain-specific retrieval model trained with Group-InfoNCE loss. CPRetriever surpasses 20+ existing embedding models across all tasks and reveals systematic evaluation bias in LiveCodeBench attributable to problem similarity.
CryptoMoE: Privacy-Preserving and Scalable Mixture of Experts Inference via Balanced Expert Routing: CryptoMoE is the first framework supporting privacy-preserving inference for MoE-based LLMs. By combining balanced expert routing to conceal routing information, a confidence-aware dispatch protocol, and a batch ciphertext matrix multiplication protocol, it achieves 2.8–3.5× latency reduction and 2.9–4.3× communication reduction compared to a dense baseline, with only 0.8% accuracy loss.
DeepPersona: A Generative Engine for Scaling Deep Synthetic Personas: This paper presents DeepPersona, a two-stage taxonomy-guided synthetic persona generation engine. Stage 1 mines a human attribute taxonomy with 8,000+ nodes from real user–ChatGPT conversations; Stage 2 generates narratively coherent personas averaging 200+ structured attributes via progressive attribute sampling. The approach achieves an 11.6% improvement in personalized QA accuracy and a 31.7% reduction in social survey simulation bias.
Demystifying Language Model Forgetting with Low-Rank Example Associations: This paper discovers that the association matrix between upstream sample forgetting and newly learned tasks exhibits a low-rank structure (rank-3 achieves $R^2 > 0.69$) after LLM fine-tuning, and leverages matrix completion to predict forgetting induced by unseen tasks, thereby guiding selective replay to mitigate forgetting.
Differentially Private Federated Low Rank Adaptation Beyond Fixed-Matrix: This paper proposes FedASK, a framework that employs a two-stage sketching pipeline (inspired by randomized SVD) to, for the first time under differential privacy, enable simultaneous effective updates of both low-rank matrices A and B in federated LoRA, achieving up to 11.5% improvement on MMLU and 46% on GSM8K over baselines on Llama-2 7B/13B.
Distributive Fairness in Large Language Models: Evaluating Alignment with Human Values: This paper systematically evaluates the distributive fairness preferences of several SOTA LLMs (GPT-4o, Claude-3.5S, Llama3-70b, Gemini-1.5P) on non-strategic resource allocation tasks. The results reveal significant divergence between LLMs and humans: LLMs favor efficiency and envy-freeness (EF) while neglecting equality (EQ), which humans prioritize. However, in multiple-choice settings, GPT-4o and Claude can correctly identify the fairest allocation.
DNA-DetectLLM: Unveiling AI-Generated Text via a DNA-Inspired Mutation-Repair Paradigm: This paper proposes DNA-DetectLLM, a zero-shot AI-generated text detection method inspired by the DNA mutation-repair mechanism. It constructs an ideal AI sequence and quantifies the cumulative difficulty of repairing the input text toward that sequence as the detection signal, achieving state-of-the-art results with a relative AUROC improvement of 5.55% and F1 improvement of 2.08% across multiple benchmark datasets.
Enhancing CLIP Robustness via Cross-Modality Alignment: This paper proposes COLA, a training-free framework that eliminates non-semantic noise by projecting adversarially perturbed image features onto the subspace spanned by text features, and then employs optimal transport (OT) to perform fine-grained distribution-level image-text alignment. COLA achieves an average improvement of 6.7% in adversarial robust accuracy across 14 zero-shot classification benchmarks while preserving clean sample performance.
Enhancing Sample Selection Against Label Noise by Cutting Mislabeled Easy Examples: This paper identifies and defines Mislabeled Easy Examples (MEEs)—samples whose incorrect labels are confidently learned by the model in the early stages of training—and demonstrates that these samples cause the greatest harm to generalization. An Early Cutting method is proposed to filter MEEs by recalibrating the early-stage confident subset using the model's later-stage state.
Evaluating the Promise and Pitfalls of LLMs in Hiring Decisions: This paper systematically evaluates the hiring-match performance of mainstream LLMs—including GPT-4o/4.1, Claude 3.5, Gemini 2.5, Llama 3.1/4, and DeepSeek R1—on approximately 10,000 real-world candidate–job pairs. Results show that a domain-specialized model (Match Score) comprehensively outperforms general-purpose LLMs in both accuracy (AUC 0.85 vs. 0.77) and fairness (Race IR 0.957 vs. ≤0.809).
Exploring the Limits of Strong Membership Inference Attacks on Large Language Models: This work presents the first extension of strong membership inference attacks (LiRA) to GPT-2-scale LLMs ranging from 10M to 1B parameters, training over 4,000 reference models. Four key findings are revealed: strong MIAs can succeed on LLMs but with limited effectiveness (AUC < 0.7), and a substantial fraction of per-sample decisions are indistinguishable from random coin flips under training randomness.
FedRW: Efficient Privacy-Preserving Data Reweighting for Enhancing Federated Learning of Language Models: FedRW proposes the first privacy-preserving soft deduplication framework for federated learning that requires no trusted third party. By leveraging secure multi-party computation to obtain global sample frequencies and performing frequency-aware sample reweighting, it achieves up to 28.78× preprocessing speedup and approximately 11.42% improvement in perplexity over prior methods.
FedSVD: Adaptive Orthogonalization for Private Federated Learning with LoRA: FedSVD proposes globally reparameterizing LoRA matrices via SVD, updating the $A$ matrix each communication round using the right singular vectors of the aggregated $BA$ product. This approach avoids the quadratic noise amplification under DP-SGD while preserving the adaptive capacity of $A$, consistently outperforming fixed-$A$ baselines across multiple NLU benchmarks.
Finding Structure in Continual Learning: This paper proposes a continual learning optimization framework based on Douglas-Rachford Splitting (DRS), which decouples stability and plasticity into two independent proximal subproblems, and replaces KL divergence with Rényi divergence for more robust prior alignment, thereby effectively alleviating catastrophic forgetting without replay buffers or additional modules.
Geo-Sign: Hyperbolic Contrastive Regularisation for Geometrically Aware Sign Language Translation: Geo-Sign projects skeleton features into a Poincaré ball model of hyperbolic space and regularizes an mT5 language model via a hyperbolic contrastive loss, enabling the model to perceive the hierarchical structure of sign language motion. Using only skeleton data, the method surpasses RGB-based SOTA on CSL-Daily (BLEU-4 +1.81, ROUGE-L +3.03).
HealthSLM-Bench: Benchmarking Small Language Models for Mobile and Wearable Healthcare Monitoring: The first benchmark systematically evaluating small language models (SLMs, 1–4B parameters) on mobile and wearable health monitoring tasks, covering zero-shot, few-shot, and instruction fine-tuning paradigms, with on-device deployment validated on an iPhone.
InvisibleInk: High-Utility and Low-Cost Text Generation with Differential Privacy: This paper proposes InvisibleInk, a framework that reduces the computational cost of differentially private long-text generation by more than 8× through two innovations—differential clipping (DClip) for isolating sensitive information and Top-$k^+$ truncated sampling—achieving, for the first time, high-quality private text generation with only 4–8× overhead over non-private generation.
Learning to Watermark: A Selective Watermarking Framework for Large Language Models via Multi-Objective Optimization: This paper proposes LTW (Learning to Watermark), a framework that employs a lightweight selector network to adaptively determine when to apply watermarks based on sentence embeddings, token entropy, and the current watermarking ratio. By leveraging multi-objective optimization via MGDA, LTW achieves a Pareto-optimal balance between detectability and text quality, substantially improving watermarked text quality without compromising detection performance.
LLM Strategic Reasoning: Agentic Study through Behavioral Game Theory: This paper proposes an LLM strategic reasoning evaluation framework grounded in behavioral game theory. It employs Truncated Quantal Response Equilibrium (TQRE) to quantify reasoning depth τ, evaluates 22 state-of-the-art models across 13 matrix games, and reveals differences in reasoning styles as well as biases induced by demographic personas.
MaskSQL: Safeguarding Privacy for LLM-Based Text-to-SQL via Abstraction: This paper proposes MaskSQL, a framework that protects privacy by replacing sensitive table names, column names, and data values with abstract symbols before sending prompts to a remote LLM. Combined with a local SLM for schema linking and SQL reconstruction, MaskSQL preserves privacy while surpassing SLM-only approaches in SQL generation accuracy.
MixAT: Combining Continuous and Discrete Adversarial Training for LLMs: This paper proposes MixAT, a method that combines discrete adversarial attacks (PAP-based rewriting) with continuous embedding-space perturbations for LLM adversarial training. MixAT achieves robustness against diverse attacks (reducing ALO-ASR from 50%+ to below 20%) while preserving utility, at a training cost comparable to purely continuous methods.
MPCache: MPC-Friendly KV Cache Eviction for Efficient Private LLM Inference: This paper proposes MPCache, a KV cache eviction framework designed for secure multi-party computation (MPC), combining one-time static eviction with query-aware dynamic selection. Through hierarchical clustering, linearized similarity approximation, and cross-layer index sharing, MPCache achieves up to 2.01× latency reduction and 8.37× communication volume reduction without sacrificing LLM performance.
Music Arena: Live Evaluation for Text-to-Music: Music Arena is the first online live evaluation platform for text-to-music (TTM) generation. It addresses the heterogeneous signature problem of TTM systems via an LLM-driven moderation and routing system, collects multi-level preference data including fine-grained listening behavior and natural language feedback, and provides the community with a sustainable open preference data source through monthly rolling data releases.
On the Empirical Power of Goodness-of-Fit Tests in Watermark Detection: This paper systematically evaluates eight classical goodness-of-fit (GoF) tests for LLM text watermark detection, demonstrating that GoF tests significantly outperform existing baseline methods in both detection power and robustness.
On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks: This paper presents the first systematic study on the robustness of LLM verbal confidence under adversarial attacks. It proposes a Verbal Confidence Attack (VCA) framework comprising perturbation-based and jailbreak-based attacks, demonstrating that such attacks can reduce confidence scores by up to 30%, cause answer-flip rates of up to 100%, and that existing defense strategies are largely ineffective.
On the Sample Complexity of Differentially Private Policy Optimization: This paper presents the first systematic study of sample complexity for policy optimization (PO) under differential privacy (DP) constraints. It proposes a unified meta-algorithm framework and analyzes three private policy optimization algorithms—DP-PG, DP-NPG, and DP-REBEL—proving that the privacy cost typically appears only as a lower-order term in the sample complexity.
ORBIT -- Open Recommendation Benchmark for Reproducible Research with Hidden Tests: This paper proposes ORBIT, a unified benchmark for recommender systems comprising standardized evaluation on 5 public datasets and a privacy-safe hidden test set, ClueWeb-Reco, constructed from real users' browsing histories. The benchmark systematically evaluates 12 recommendation models and introduces the LLM-QueryGen baseline, revealing the limitations of existing approaches in large-scale, real-world recommendation scenarios.
Poly-Guard: Massive Multi-Domain Safety Policy-Grounded Guardrail Dataset: This paper introduces Poly-Guard, the first large-scale, multi-domain, policy-grounded safety guardrail benchmark. It extracts 400+ risk categories and 1,000+ safety rules from 150+ real-world industry safety policies, generates 100K+ instances spanning 8 safety-critical domains, and systematically evaluates 19 guardrail models, revealing 8 key findings including domain specialization, evolutionary forgetting, scaling stagnation, and adversarial vulnerability.
Probabilistic Reasoning with LLMs for K-Anonymity Estimation: This paper proposes Branch, a framework that leverages large language models to model personal information disclosed in user-generated text as a joint probability distribution over a Bayesian network. By estimating conditional probabilities for individual attributes and composing them to compute k-anonymity values (i.e., the number of individuals globally matching a given profile), Branch achieves 73% accuracy on privacy risk estimation, outperforming o3-mini chain-of-thought reasoning by 13%.
Procurement Auctions with Predictions: Improved Frugality for Facility Location: This paper studies procurement auction design for the strategic uncapacitated facility location problem. It proves that the frugality ratio of the classical VCG auction is exactly 3 (improving the previously known upper bound of 4), and designs learning-augmented auction mechanisms that exploit prediction information to achieve near-optimal frugality when predictions are accurate, while maintaining constant-factor robustness when predictions are arbitrarily inaccurate.
PULSE: Practical Evaluation Scenarios for Large Multimodal Model Unlearning: This paper proposes the PULSE evaluation protocol, which assesses existing unlearning methods for large multimodal models (LMMs) along two practically motivated dimensions: the forgetting of pretrained knowledge and the sustainability of repeated sequential unlearning. The findings reveal severe deficiencies in current methods—forgetting pretrained knowledge causes over 90% loss of general capability, and after five sequential unlearning operations, model generalization nearly collapses entirely.
Reinforcement Learning with Backtracking Feedback: This paper proposes RLBF, a reinforcement learning framework with backtracking feedback that allows agents to return to previous states and re-explore when encountering dead ends. By leveraging backtracking signals to improve credit assignment, RLBF significantly enhances exploration efficiency in sparse-reward environments.
ReliabilityRAG: Effective and Provably Robust Defense for RAG-based Web-Search: ReliabilityRAG proposes a RAG framework that leverages document reliability signals (e.g., search ranking) for adversarial defense. It identifies a consistent document subset by finding the Maximum Independent Set (MIS) on a contradiction graph while prioritizing high-reliability documents, providing provable robustness guarantees alongside high accuracy on benign scenarios and long-form generation tasks.
Reverse Engineering Human Preferences with Reinforcement Learning: A reinforcement learning-trained preamble generator is used to inflate the evaluation scores of downstream LLMs, exposing critical vulnerabilities in the LLM-as-a-Judge evaluation framework. The attack is nearly undetectable and demonstrates cross-model transferability.
SAEMark: Steering Personalized Multilingual LLM Watermarks with Sparse Autoencoders: This paper proposes SAEMark, a framework that leverages sparse autoencoders (SAEs) to extract Feature Concentration Scores (FCS) from text, and embeds multi-bit watermarks via inference-time feature-guided rejection sampling. The approach requires no modification to model weights or logits, natively supports black-box APIs, multilingual text, and code, and achieves state-of-the-art watermark detectability and text quality across English, Chinese, and code domains.
SECA: Semantically Equivalent and Coherent Attacks for Eliciting LLM Hallucinations: This paper proposes SECA (Semantically Equivalent and Coherent Attacks), a realistic prompt perturbation framework that elicits LLM hallucinations while preserving semantic equivalence and coherence, achieving higher attack success rates on multiple-choice QA tasks with near-zero semantic errors.
Self-Refining Language Model Anonymizers via Adversarial Distillation: This paper proposes SEAL, a framework that distills GPT-4-level text anonymization capabilities into an 8B model via adversarial distillation, combining SFT + DPO training with a self-refinement mechanism. The resulting small model achieves privacy–utility trade-offs on par with or superior to GPT-4-based anonymizers while enabling fully local deployment.
SIMU: Selective Influence Machine Unlearning: SIMU proposes a two-stage framework: it first identifies critical MLP neurons encoding forget-set information via gradient aggregation, then applies second-order (Sophia) optimization exclusively to those neurons, achieving effective unlearning while substantially preserving the model's original capabilities.
Stop DDoS Attacking the Research Community with AI-Generated Survey Papers: This position paper analogizes the proliferation of AI-generated survey papers to a "Distributed Denial-of-Service (DDoS) attack" on the academic community. Through systematic quantitative analysis of 10,063 CS survey papers on arXiv from 2020 to 2024, the paper documents synchronized post-ChatGPT surges in survey volume, AI-generation scores, and anomalous author counts. It diagnoses four major quality deficiencies in AI-generated surveys (disorganized structure, unoriginal taxonomies, inaccurate citations, and highly redundant content), analyzes cultural repercussions for the researcher–reviewer–editor triad, and proposes a comprehensive response framework encompassing transparency requirements, rigorous review standards, redundancy restrictions, AI-detection assistance, and a "Dynamic Live Survey" platform.
SWE-SQL: Illuminating LLM Pathways to Solve User SQL Issues in Real-World Applications: This paper proposes BIRD-CRITIC (the first SQL debugging benchmark) and the Six-Gym training environment, and develops the Bird-Fixer agent. Through the f-Plan Boosting strategy, it elevates the SQL debugging capability of a 14B open-source model to surpass Claude-3.7-Sonnet and GPT-4.1, achieving efficient SQL issue resolution while preserving data privacy.
Teaming LLMs to Detect and Mitigate Hallucinations: This paper generalizes single-model consistency methods (Self-Consistency + Semantic Entropy) to a multi-model "consortium" setting comprising heterogeneous LLMs. By aggregating responses from models with diverse training backgrounds, the approach breaks the consistent hallucinations that arise within a single model. Evaluating a large number of consortium combinations over a pool of 15 LLMs, the paper finds that well-matched strong-model consortia outperform the strongest single-model baseline in 92% of cases while incurring lower inference cost.
ToxicTextCLIP: Text-Based Poisoning and Backdoor Attacks on CLIP Pre-training: This paper proposes ToxicTextCLIP, a framework that generates high-quality adversarial texts during CLIP pre-training via two modules—Background-aware Target Text Selector and Background-driven Poisoned Text Augmenter—achieving up to 95.83% attack success rate and 98.68% backdoor Hit@1, while successfully bypassing three defenses: RoCLIP, CleanCLIP, and SafeCLIP.
Trans-EnV: A Framework for Evaluating the Linguistic Robustness of LLMs Against English Varieties: This paper proposes the Trans-EnV framework, which combines expert linguistic knowledge with the transformation capabilities of LLMs to automatically convert Standard American English (SAE) datasets into 38 English varieties (18 dialects + 20 ESL Englishes), revealing performance degradations of up to 46.3% on non-standard English and highlighting critical linguistic fairness concerns.
TRAP: Targeted Redirecting of Agentic Preferences: TRAP introduces a diffusion-based semantic injection adversarial framework that optimizes image semantics in the CLIP embedding space. Under black-box conditions, it systematically misdirects the decision preferences of multiple mainstream VLM agents in a visually natural manner, achieving attack success rates of up to 100% across six models including LLaVA-34B and GPT-4o.
TRUST -- Transformer-Driven U-Net for Sparse Target Recovery: This paper proposes the TRUST architecture, which integrates the Transformer attention mechanism with a U-Net decoder to jointly learn the sensing operator and reconstruct sparse signals under unknown sensing matrices, achieving significant improvements over conventional methods in SSIM and PSNR.
Unlearning as Ablation: Toward a Falsifiable Benchmark for Generative Scientific Discovery: This paper proposes reframing machine unlearning as an epistemological probe ("unlearning as ablation"): by systematically removing a target piece of knowledge along with its unlearning closure, and then testing whether a model can re-derive it from axioms, the framework provides a falsifiable test to distinguish whether LLMs genuinely generate new knowledge or merely retrieve memorized fragments.
Virus Infection Attack on LLMs: Your Poisoning Can Spread "VIA" Synthetic Data: This paper presents the first systematic study of security risks introduced by synthetic data in LLM training. It reveals that existing poisoning and backdoor attacks rarely propagate through synthetic data, and proposes the Virus Infection Attack (VIA) framework. VIA embeds poisoned content into normal training samples via hijacking point search and shell construction, enabling malicious content to be generated by the model even on clean queries and subsequently propagated to downstream models.
When AI Democratizes Exploitation: LLM-Assisted Strategic Manipulation of Fair Division Algorithms: This paper empirically demonstrates that LLMs can reduce algorithm manipulation in fair division—previously requiring deep expertise in mechanism design—to a simple natural language conversation available to any user. Four coordination scenarios are designed on the Spliddit fair rent platform (exclusionary collusion, defensive counter-attack, benevolent collusion, and cost-minimization coalition), fundamentally overturning the traditional assumption that "algorithmic complexity serves as a security barrier."

📈 Time Series¶

A Graph Neural Network Approach for Localized and High-Resolution Temperature Forecasting: This paper proposes a GCN-GRU hybrid framework for community-scale (2.5 km) high-resolution temperature forecasting (1–48 hours), validated across three regions in southwestern Ontario, Canada. The largest region achieves an average MAE of 1.93°C and a 48-hour MAE of 2.93°C. The work explores ClimateBERT language model embeddings as a standardized input scheme, and provides a transferable lightweight forecasting framework targeting data-scarce regions in the Global South.
Abstain Mask Retain Core: Time Series Prediction by Adaptive Masking Loss with Representation Consistency: This paper reveals a counter-intuitive phenomenon in time series forecasting — that appropriately truncating historical inputs can improve prediction accuracy (termed the redundant feature learning problem) — and proposes AMRC based on information bottleneck theory. AMRC suppresses redundant feature learning via adaptive masking loss and representation consistency constraints, serving as a model-agnostic training framework that consistently improves performance across diverse architectures.
AERO: A Redirection-Based Optimization Framework Inspired by Judo for Robust Probabilistic Forecasting: AERO proposes an optimization paradigm inspired by the judo principle of "redirecting force rather than resisting it," attempting to redirect adversarial perturbations into beneficial optimization signals. The framework is theoretically grounded in 15 axioms and 4 theorems, constructing an energy-conservation-based gradient redirection system. However, the actual implementation is substantially simplified to momentum SGD with Gaussian noise injection, and validation is conducted solely on a single private solar energy price prediction dataset without any baseline comparisons.
AttentionPredictor: Temporal Patterns Matter for KV Cache Compression: AttentionPredictor is the first learning-based method that directly predicts attention patterns for KV cache compression and critical token identification. By leveraging a lightweight CNN to capture spatiotemporal patterns in attention scores, it achieves 13× KV cache compression and 5.6× inference speedup, with a unified prediction model of only 21 KB shared across all Transformer layers.
Benchmarking Probabilistic Time Series Forecasting Models on Neural Activity: The first systematic evaluation of 12 probabilistic time series forecasting models on mouse cortical calcium imaging data. PatchTST consistently achieves top performance (informative prediction horizon up to 1.5 s), zero-shot foundation models (Chronos) fail entirely but become competitive after fine-tuning, and the intrinsic predictability ceiling of neural activity is found to be approximately 1.5 seconds.
BubbleFormer: Forecasting Boiling with Transformers: This paper proposes BubbleFormer, a Transformer architecture based on decomposed spatiotemporal attention for forecasting boiling dynamics—including the notoriously difficult spontaneous bubble nucleation events—accompanied by the BubbleML 2.0 dataset (160+ high-fidelity simulations), achieving accurate spatiotemporal boiling predictions across diverse fluids, geometries, and wall conditions.
Causal Masking on Spatial Data: An Information-Theoretic Case for Learning Spatial Datasets with Unimodal Language Models: This paper demonstrates that applying causal masking directly to spatial data (chess board states in FEN format) for training a unimodal LLM outperforms first linearizing the data into sequences (PGN move records) and then applying causal masking — Llama 1.3B trained with FEN + causal masking achieves ~2630 Elo, whereas PGN + causal masking yields only ~2130 Elo.
CausalDynamics: A Large-Scale Benchmark for Structural Discovery of Dynamical Causal Models: This paper introduces CausalDynamics — the largest benchmark to date for causal discovery in dynamical systems (14,000+ graphs, 50M+ samples) — encompassing a three-tier progressively complex hierarchy ranging from 3-dimensional chaotic ODE/SDE systems and hierarchically coupled systems to realistic climate models. The benchmark comprehensively evaluates 10 state-of-the-art causal discovery algorithms, revealing the shortcomings of current deep learning methods on high-dimensional nonlinear dynamical systems.
Channel Matters: Estimating Channel Influence for Multivariate Time Series: This paper proposes Channel-wise Influence (ChInf)—the first influence function method capable of quantifying the effect of individual channels on model performance in multivariate time series (MTS). By decomposing TracIn from the holistic sample level to the channel level, ChInf enables two downstream applications: channel-level anomaly detection and channel pruning, achieving state-of-the-art performance on 5 anomaly detection benchmarks.
Connecting the Dots: A Machine Learning Dataset for Ionospheric Prediction: This paper constructs an open, ML-ready ionospheric prediction dataset that integrates 8 heterogeneous data sources (solar observations, geomagnetic indices, TEC maps, etc.) spanning approximately 14 years (2010–2024). Three spatiotemporal baseline models—LSTM, SFNO, and GraphCast—are trained on this dataset, achieving TEC forecasts with lead times up to 12 hours.
Decomposition of Small Transformer Models: This paper extends Stochastic Parameter Decomposition (SPD) to Transformers by designing a sequence-aware causal importance function and a novel partial reconstruction loss. On a toy induction head task, the method recovers the expected two-step circuit; on GPT-2-small, it localizes rank-1 parameter subspaces corresponding to interpretable concepts such as "golf" and "basketball."
DemandCast: Global hourly electricity demand forecasting: DemandCast is an open-source machine learning framework that leverages XGBoost to integrate historical electricity demand, ERA5 temperature data, and socioeconomic features for hourly electricity demand forecasting across 56 countries/regions worldwide. By normalizing the target variable as a fraction of annual demand, the framework achieves cross-country comparability and attains a MAPE of 9.2% on a temporally held-out test set.
Diffusion Transformers as Open-World Spatiotemporal Foundation Models: This paper proposes UrbanDiT, the first open-world urban spatiotemporal foundation model based on Diffusion Transformers. It integrates heterogeneous data types (grid/graph) and diverse tasks (prediction, interpolation, extrapolation, imputation) through a unified prompt learning framework, achieving state-of-the-art performance across multiple cities and scenarios while demonstrating strong zero-shot generalization.
Diffusion Transformers for Imputation: Statistical Efficiency and Uncertainty Quantification: This paper analyzes the sample complexity and uncertainty quantification performance of conditional diffusion Transformers (DiT) for time series imputation from a statistical learning perspective, and proposes a mixed-masking training strategy to improve imputation quality.
Exploring Neural Granger Causality with xLSTMs: Unveiling Temporal Dependencies in Complex Data: This paper proposes GC-xLSTM, which leverages the xLSTM architecture combined with a novel dynamic sparsity optimization strategy to uncover Granger causal relationships in multivariate time series, achieving state-of-the-art performance on multiple datasets.
Feature-aware Modulation for Learning from Temporal Tabular Data: This paper addresses distribution shift in temporal tabular data by proposing a feature-aware temporal modulation mechanism. Through learnable transformations conditioned on temporal context, it dynamically adjusts per-feature shift ($\beta$), scale ($\gamma$), and skewness ($\lambda$) to align feature semantics across time. On the TabReD benchmark, it is the first approach to enable deep learning methods to systematically outperform GBDT.
Frequency Matters: When Time Series Foundation Models Fail Under Spectral Shift: This paper identifies spectral shift—a mismatch between the dominant frequencies of downstream data and those covered by pretraining data—as the key reason for generalization failure of Time Series Foundation Models (TSFMs) in industrial settings. The hypothesis is validated through an industrial-scale mobile game player engagement prediction task and controlled synthetic experiments.
Fern: Chaining Spectral Pearls — Ellipsoidal Forecasting Beyond Trajectories for Time Series: This paper proposes Fern (Forecasting with Ellipsoidal RepresentatioN), which replaces conventional trajectory prediction with patch-wise ellipsoidal transport (rotation–scaling–translation). Fern substantially outperforms baselines on chaotic systems while remaining competitive on standard LTSF benchmarks.
How Foundational are Foundation Models for Time Series Forecasting?: Through systematic experiments on synthetic and real-world electricity consumption data, this paper reveals that the zero-shot generalization capability of time series foundation models (TSFMs) is highly dependent on the pretraining data distribution. Under domain shift, SAMFormer—a lightweight specialized model with only 49.5K parameters trained from scratch—outperforms fine-tuned TimesFM with 500M+ parameters.
How Patterns Dictate Learnability in Sequential Data: This paper proposes an information-theoretic framework based on predictive information $\mathbf{I}(X_{\text{past}}; X_{\text{future}})$ to quantify the strength of temporal patterns in sequential data. It derives theoretical bounds linking predictive information to the minimum achievable risk, thereby enabling a distinction between "insufficient model capacity" and "intrinsically unpredictable data."
Human-Machine Ritual: Synergic Performance through Real-Time Motion Recognition: This paper proposes a lightweight real-time motion recognition system that leverages wearable IMU sensors combined with the MiniRocket time-series classifier to achieve dancer-specific motion recognition with <50ms latency and 96.05% accuracy. Through "embodied memory mapping," the system encodes each dancer's personal movement-sound associations, establishing a human-machine collaborative performance paradigm that respects the expressive depth of the human body.
Improving Time Series Forecasting via Instance-aware Post-hoc Revision (PIR): PIR proposes an instance-aware post-hoc revision framework that identifies poorly predicted instances via uncertainty estimation and applies a residual combination of local correction (covariate + exogenous variable Transformer) and global correction (retrieval-based weighted average over similar training instances) as a plug-and-play module, reducing SparseTSF MSE by 25.87% and PatchTST MSE by 8.99%.
In-Context Learning of Stochastic Differential Equations with Foundation Inference Models: This paper proposes FIM-SDE, a pretrained recognition model capable of zero-shot (in-context) estimation of drift and diffusion functions of low-dimensional SDEs from noisy time series data, and further surpasses all baseline methods via rapid fine-tuning.
IonCast: A Deep Learning Framework for Forecasting Ionospheric Dynamics: This paper proposes IonCast, a GraphCast-inspired graph neural network framework that integrates multi-source heterogeneous physics-driven data to achieve high-accuracy spatiotemporal forecasting of global Total Electron Content (TEC).
IonCast: A Deep Learning Framework for Forecasting Ionospheric Dynamics: This paper proposes IonCast, a framework comprising a GraphCast-based GNN model and a ConvLSTM baseline that integrates multi-source heterogeneous space weather data (TEC maps, solar wind, geomagnetic indices, orbital mechanics, etc.) for global spatiotemporal forecasting of ionospheric total electron content (TEC). IonCast outperforms persistence baselines and the IRI empirical model under geomagnetic storm conditions.
Learning Time-Scale Invariant Population-Level Neural Representations: This paper proposes Time-Scale Augmented Pretraining (TSAP), a strategy that introduces data augmentation over multiple temporal window lengths during pretraining, enabling population-level neural signal foundation models to achieve invariance to input time scales and substantially improving decoding performance at both matched and unseen time scales.
Learning with Calibration: Exploring Test-Time Computing of Spatio-Temporal Forecasting: This paper proposes ST-TTC, a lightweight test-time computing paradigm that corrects periodic biases in spatio-temporal forecasting during inference via a frequency-domain phase-amplitude calibrator and a flash gradient update mechanism, consistently improving the performance of diverse backbone models without modifying their architectures.
Less is More: Unlocking Specialization of Time Series Foundation Models via Structured Pruning: This paper reveals that pretrained time series foundation models (TSFMs) exhibit inherent task-relevant sparsity, and proposes a Prune-then-Finetune paradigm—removing task-irrelevant parameters via structured pruning so that a pruned-then-finetuned smaller model significantly outperforms direct fine-tuning of the full model, and even surpasses strong specialized baselines.
MAESTRO: Adaptive Sparse Attention and Robust Learning for Multimodal Dynamic Time Series: This paper proposes MAESTRO, a framework that addresses modality heterogeneity and arbitrary missingness in multimodal time series via symbolic tokenization, adaptive attention budgeting, sparse cross-modal attention, and dynamic MoE routing, substantially outperforming baselines under both complete and missing modality scenarios.
Martingale Score: An Unsupervised Metric for Bayesian Rationality in LLM Reasoning: This paper proposes the Martingale Score as an unsupervised metric that quantifies belief entrenchment in LLM reasoning processes based on the martingale property from Bayesian statistics. The study finds that belief entrenchment is pervasive across models and domains, and is significantly correlated with degraded accuracy.
MASFIN: A Multi-Agent System for Decomposed Financial Reasoning and Forecasting: This paper proposes MASFIN, a multi-agent system that decomposes financial forecasting into multiple sub-tasks (macroeconomic analysis, industry analysis, technical analysis, sentiment analysis, etc.), with specialized LLM agents collaborating to produce more accurate and interpretable financial predictions than single-model approaches.
Multi-Scale Finetuning for Encoder-based Time Series Foundation Models: This paper proposes MSFT (Multi-Scale FineTuning), which leverages causal analysis to reveal that naive fine-tuning suffers from scale confounding, and designs a multi-scale modeling framework for efficient fine-tuning of encoder-based time series foundation models, significantly outperforming both naive fine-tuning and from-scratch SOTA methods.
Neural MJD: Neural Non-Stationary Merton Jump Diffusion for Time Series Prediction: This paper proposes Neural MJD, which parameterizes a non-stationary Merton Jump Diffusion model via neural networks, casting prediction as an SDE simulation problem. The framework combines a time-varying Itô diffusion (capturing continuous drift) with a time-varying compound Poisson process (modeling abrupt jumps), and employs likelihood truncation together with an Euler-Maruyama with Restart solver to enable scalable learning and inference.
NSW-EPNews: A News-Augmented Benchmark for Electricity Price Forecasting with LLMs: This paper introduces NSW-EPNews, the first electricity price forecasting benchmark augmented with news text, systematically evaluating both traditional models and LLMs on multimodal electricity price prediction. Key findings show that news features provide marginal gains for traditional models, while LLMs suffer from severe hallucination issues.
Parallelization of Non-linear State-Space Models: Scaling Up Liquid-Resistance Liquid-Capacitance Networks for Efficient Sequence Modeling: This paper proposes LrcSSM, which achieves exact and efficient parallelization of nonlinear RNNs by constraining the Jacobian matrix of Liquid-Resistance Liquid-Capacitance (LRC) networks to be diagonal, surpassing Transformer, LRU, S5, and Mamba on long-sequence classification benchmarks.
Physics-informed Reduced Order Modeling of Time-dependent PDEs via Differentiable Solvers: This paper proposes Φ-ROM, a framework that embeds differentiable PDE solvers into the training loop of nonlinear reduced order models. By leveraging solver feedback to directly constrain latent space dynamics, Φ-ROM significantly outperforms purely data-driven ROMs and other physics-informed methods in generalization to unseen parameters/initial conditions, long-horizon extrapolation, and solution recovery from sparse observations.
PlanU: Large Language Model Reasoning through Planning under Uncertainty: This paper proposes PlanU—an LLM decision-making method that models node returns via quantile distributions within MCTS and balances exploration and exploitation through an Upper Confidence Bounds with Curiosity (UCC) score. PlanU is the first approach to systematically and simultaneously address both LLM uncertainty and environmental uncertainty, achieving substantial improvements over existing methods across multiple stochastic environment benchmarks.
Power Ensemble Aggregation for Improved Extreme Event AI Prediction: This paper proposes an adaptive ensemble aggregation method based on the power mean. By applying nonlinear aggregation (power exponent $p>1$) to the score of ensemble members from generative weather prediction models, the method significantly improves classification performance for extreme high-temperature events, with greater gains at higher quantile thresholds.
Probability Calibration for Precipitation Nowcasting: This paper proposes the Expected Threshold Calibration Error (ETCE) as a more appropriate metric for probability calibration in precipitation nowcasting, and extends post-hoc calibration techniques from computer vision to the forecasting domain. By incorporating a lead-time-conditioned Selective Scaling method, the proposed approach reduces model calibration error by up to 23.5%.
RiverMamba: A State Space Model for Global River Discharge and Flood Forecasting: The first deep learning model capable of 7-day river discharge forecasting on a 0.05° (~5.5 km) global grid — global grid points are serialized via space-filling curves into 3D spatiotemporal point sequences fed into bidirectional Mamba blocks, driven by ECMWF HRES meteorological forecasts, achieving F1 = 0.459 on flood detection across 1.5–500-year return periods, surpassing LSTM (0.358) and the physical model GloFAS.
Rotary Masked Autoencoders are Versatile Learners: This paper proposes RoMAE, which extends Rotary Position Embedding (RoPE) to continuous positions and integrates it with Masked Autoencoders (MAE). Without any time-series-specific architectural modifications, RoMAE matches or surpasses specialized models across diverse modalities including irregular time series, images, and audio.
Scalable Signature Kernel Computations for Long Time Series via Local Neumann Series Expansions: This paper proposes PowerSig, which efficiently computes signature kernels via locally adaptive truncated Neumann series expansions, reducing memory from $O(\ell^2)$ to $O(\ell P)$ and enabling signature kernel computation on time series of length exceeding one million on a single GPU.
ScatterAD: Temporal-Topological Scattering Mechanism for Time Series Anomaly Detection: This paper proposes scattering as a novel inductive bias for anomaly detection — anomalous samples are more dispersed than normal samples in the high-dimensional representation space. A dual-encoder architecture (temporal + topological) combined with hyperspherical scattering center constraints and contrastive fusion is used to learn joint temporal-topological representations, achieving best performance in 15/24 settings across 6 industrial IoT datasets.
Selective Learning for Deep Time Series Forecasting: This paper proposes a Selective Learning strategy that employs a dual-mask mechanism—comprising an uncertainty mask and an anomaly mask—to identify generalizable time steps for MSE loss computation. The approach achieves an average MSE reduction of 37.4% for Informer, 8.4% for TimesNet, and 6.5% for iTransformer across 8 benchmark datasets.
SEMPO: Lightweight Foundation Models for Time Series Forecasting: This paper proposes SEMPO — a lightweight time series foundation model with only 6.5M parameters pretrained on 83M time points — that combines energy-aware spectral decomposition with a mixture-of-prompts Transformer to surpass large foundation models with over 100× more parameters in zero-shot and few-shot forecasting.
Simple and Efficient Heterogeneous Temporal Graph Neural Network: This paper proposes SE-HTGNN, which integrates temporal modeling into spatial learning via a dynamic attention mechanism and initializes attention coefficients using LLM-generated priors, achieving up to 10× speedup over prior methods while maintaining state-of-the-art predictive accuracy on heterogeneous temporal graph tasks.
Statistical Guarantees for High-Dimensional Stochastic Gradient Descent: This work introduces coupling techniques from high-dimensional nonlinear time series into online learning, providing the first rigorous moment convergence bounds and high-probability concentration inequalities—under $\ell^s$ and $\ell^\infty$ norms—for constant learning rate SGD and its Ruppert–Polyak averaged variant (ASGD) in high dimensions.
StRap: Spatio-Temporal Pattern Retrieval for Out-of-Distribution Generalization: This paper proposes StRap, a framework that constructs a multi-dimensional pattern memory bank comprising spatial, temporal, and spatio-temporal key-value pairs. At inference time, StRap retrieves historical patterns most similar to the current input and adaptively fuses them into the model representation, effectively addressing the Spatio-Temporal Out-Of-Distribution (STOOD) problem in streaming spatio-temporal data.
Structured Temporal Causality for Interpretable Multivariate Time Series Anomaly Detection: This paper proposes OracleAD, a framework that learns causal embeddings for each variable (via LSTM encoding and attention pooling) and constructs a Stable Latent Structure (SLS) to model inter-variable relationships under normal conditions. A dual scoring mechanism combining prediction error and SLS deviation enables interpretable multivariate time series anomaly detection and root cause localization.
Synthetic Series-Symbol Data Generation for Time Series Foundation Models: This paper proposes the Series-Symbol (S²) data generation mechanism and SymTime, a dual-modality foundation model. Grounded in Takens' theorem and symbolic dynamics theory, the framework generates unlimited synthetic time series–symbol paired data (40M pairs / 50B tokens). Through cross-modal contrastive pre-training, SymTime achieves performance competitive with models pre-trained on real data across five time series tasks.
SynTSBench: Rethinking Temporal Pattern Learning in Deep Learning Models for Time Series: This paper proposes SynTSBench, a synthetic data-driven evaluation paradigm that systematically assesses the actual modeling capabilities of time series forecasting models across dimensions such as trend, periodicity, dependency, and noise robustness, through programmable feature configurations and theoretically optimal benchmarks.
The Human Brain as a Combinatorial Complex: This paper proposes a data-driven framework that constructs Combinatorial Complexes (CCs) directly from fMRI time series using information-theoretic measures—namely S-information and O-information—encoding higher-order synergistic interactions among brain regions into topological structures, thereby laying the groundwork for applying topological deep learning to brain network analysis.
Time-IMM: A Dataset and Benchmark for Irregular Multimodal Multivariate Time Series: This work constructs Time-IMM — the first multimodal multivariate time series benchmark that categorizes irregularity according to causal mechanisms (9 irregularity types organized into three classes: Trigger, Constraint, and Artifact, spanning 9 datasets). An accompanying forecasting library, IMM-TSF, supports asynchronous multimodal fusion. Experiments demonstrate that explicitly modeling multimodal information reduces MSE by 6.71% on average across irregular time series settings, with a maximum improvement of 38.38%.
Time-O1: Time-Series Forecasting Needs Transformed Label Alignment: This paper proposes Time-O1, which addresses the autocorrelation bias and task overload of the TMSE loss in time series forecasting by transforming label sequences into decorrelated, importance-ranked principal components. The method achieves state-of-the-art performance while remaining compatible with a wide range of forecasting models.
TimePerceiver: An Encoder-Decoder Framework for Generalized Time-Series Forecasting: TimePerceiver proposes a unified encoder-decoder framework that generalizes the forecasting task (encompassing extrapolation, interpolation, and imputation) and employs a latent bottleneck encoder with a query-based decoder, achieving comprehensive state-of-the-art performance across 8 standard benchmarks.
Transformer Embeddings for Fast Microlensing Inference: This paper combines a Transformer encoder with Neural Posterior Estimation (NPE) to perform fast, well-calibrated parameter inference directly from sparse, noisy, and irregularly sampled microlensing light curves, achieving speedups exceeding $10^4\times$ over traditional MCMC methods.
Universal Spectral Tokenization via Self-Supervised Panchromatic Representation Learning: This paper proposes the first universal spectral tokenizer that jointly trains on heterogeneous astronomical spectra (SDSS/DESI/GALAH/APOGEE) on their native wavelength grids via continuous wavelength embeddings and self-supervised reconstruction objectives, producing aligned, uniform, and physically meaningful representations.
WaLRUS: Wavelets for Long-range Representation Using SSMs: This paper proposes WaLRUS, a state space model (SSM) built upon Daubechies wavelets as a novel instantiation of the SaFARi framework, expanding the diversity of the SSM family and demonstrating unique advantages in long-range dependency modeling.
Wavelet Canonical Coherence for Nonstationary Signals: This paper proposes WaveCanCoh, a framework that extends classical canonical coherence analysis to the wavelet domain. Built upon the multivariate locally stationary wavelet (MvLSW) model, it enables estimation of time-varying, scale-specific canonical coherence between two groups of nonstationary multivariate time series.
xLSTM-Mixer: Multivariate Time Series Forecasting by Mixing via Scalar Memories: This paper proposes xLSTM-Mixer, the first architecture to combine the Extended Long Short-Term Memory network (sLSTM) with a Mixer framework. Through a three-stage design comprising temporal mixing, joint temporal-variate mixing, and multi-view mixing, the model achieves state-of-the-art performance on multivariate long-term time series forecasting while maintaining an extremely low memory footprint.

📹 Video Understanding¶

A Little Depth Goes a Long Way: The Expressive Power of Log-Depth Transformers: This paper proves that increasing Transformer depth from a constant to $\Theta(\log n)$ unlocks the ability to recognize regular languages and solve graph connectivity — two problems provably beyond the reach of fixed-depth Transformers — and that depth scaling is strictly more efficient than width scaling (which requires super-polynomial growth) or Chain-of-Thought (CoT) steps (which requires super-logarithmic growth).
AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding: AdaVideoRAG is proposed to route queries to one of three retrieval pathways (no retrieval / naive retrieval / graph retrieval) via a lightweight intent classifier, combined with an omni-knowledge indexing module (caption + ASR + OCR + visual + knowledge graph) to achieve an optimal efficiency–accuracy trade-off in long video understanding, yielding a 39.8% improvement for Qwen2.5-VL-7B on MLVU.
Adversarial Locomotion and Motion Imitation for Humanoid Policy Learning: ALMI proposes an upper-lower body adversarial training framework: the lower-body policy learns robust locomotion under upper-body motion perturbations, while the upper-body policy learns precise motion imitation under lower-body locomotion perturbations. Through iterative adversarial training converging to a Nash equilibrium, the framework enables stable whole-body coordinated control on the Unitree H1-2 real robot.
Agentic Persona Control and Task State Tracking for Realistic User Simulation: A three-agent collaborative framework for realistic user simulation is proposed, comprising a User Agent (coordination), a State Tracking Agent (structured task state), and a Message Attributes Generation Agent (behavior attribute control conditioned on persona and state). On a restaurant ordering scenario, the framework achieves a 102.6% improvement in composite realism score (CRRS), +19.9% in persona adherence, and +284.5% in behavioral variability. A core finding is that behavior control without state awareness yields BVS = 0 (completely rigid behavior).
Cloud4D: Estimating Cloud Properties at a High Spatial and Temporal Resolution: The first learning framework based on ground-level multi-view cameras that reconstructs four-dimensional (3D spatial + temporal) cloud liquid water content distributions via a homography-guided 2D-to-3D Transformer. The method achieves less than 10% error relative to radar at 25 m spatial and 5 s temporal resolution, improving spatiotemporal resolution by an order of magnitude over satellite observations.
ConViS-Bench: Estimating Video Similarity Through Semantic Concepts: This paper introduces ConViS, a concept-based video similarity estimation task, along with its accompanying benchmark ConViS-Bench (610 video pairs, 16 domains, 5 concepts). It systematically evaluates 10+ mainstream models on concept-conditioned video comparison, revealing significant deficiencies in current models' understanding of temporal structure and spatial context.
DeltaProduct: Improving State-Tracking in Linear RNNs via Householder Products: This paper proposes DeltaProduct, which extends DeltaNet's single-step gradient descent to $n_h$-step gradient descent per token, yielding a state transition matrix expressed as a product of $n_h$ generalized Householder transformations. This achieves a tunable trade-off between expressivity and efficiency, significantly improving state-tracking capability and length extrapolation performance.
Dense SAE Latents Are Features, Not Bugs: This paper systematically investigates frequently activating "dense latents" in sparse autoencoders (SAEs), demonstrating that they are not training artifacts but rather reflections of intrinsically dense subspaces in language model residual streams. The authors propose a six-category taxonomy of dense latents encompassing position tracking, context binding, null space, alphabetic, part-of-speech, and PCA latents.
Disentangled Concepts Speak Louder Than Words: Explainable Video Action Recognition: This paper proposes DANCE, a framework that achieves structured and motion-aware explainable video action recognition by disentangling action explanations into three concept types: motion dynamics, objects, and scenes.
DSAS: A Universal Plug-and-Play Framework for Attention Optimization in Multi-Document Question Answering: This paper proposes Dual-Stage Adaptive Sharpening (DSAS), a training-free plug-and-play attention optimization framework. It employs Contextual Gate Weighting (CGW) to enhance attention from key passages toward the question and target positions, and Reciprocal Attention Suppression (RAS) to suppress information exchange between key and irrelevant passages, achieving an average F1 improvement of 4.2% on multi-document QA benchmarks.
DualGround: Structured Phrase and Sentence-Level Temporal Grounding: This paper identifies that existing video temporal grounding (VTG) models over-rely on the global sentence semantics encoded in the [EOS] token while neglecting word-level signals. It proposes DualGround, a dual-branch architecture that explicitly decouples global and local semantics via a sentence-level path (adaptive cross-attention) and a phrase-level path (recurrent phrase generation + Slot Attention), achieving state-of-the-art performance on QVHighlights and Charades-STA.
egoEMOTION: Egocentric Vision and Physiological Signals for Emotion and Personality Recognition in Real-World Tasks: This paper introduces egoEMOTION — the first dataset combining egocentric vision (Meta Project Aria glasses) with physiological signals for emotion and personality recognition. It encompasses 43 participants, 50+ hours of recordings, and 16 tasks, and demonstrates that egocentric vision signals (particularly eye-tracking features) outperform conventional physiological signals for emotion prediction in real-world scenarios.
EgoGazeVQA: Egocentric Gaze-Guided Video Question Answering Benchmark: This paper introduces EgoGazeVQA, the first egocentric video question answering benchmark that incorporates user eye-gaze data. Through gaze-guided prompting strategies (textual, visual, and salience map), the benchmark demonstrates substantial improvements in MLLMs' ability to understand user intent. The Gaze Salience Map strategy raises MiniCPM-o's accuracy from 35.9% to 53.7%.
Empower Words: DualGround for Structured Phrase and Sentence-Level Temporal Grounding: DualGround identifies a critical issue in existing VTG models — over-reliance on the global semantics of the [EOS] token while neglecting word-level signals — and proposes a sentence-level + phrase-level dual-path architecture. Through an Adaptive Cross-Attention (ACA) module and a Recurrent Phrase Generator (RPG), the model captures global and local semantics respectively, achieving state-of-the-art performance on QVHighlights and Charades-STA.
Enhancing Temporal Understanding in Video-LLMs through Stacked Temporal Attention in Vision Encoders: This paper proposes STAVEQ2, which inserts parameter-efficient Stacked Temporal Attention (STA) modules into the Vision Encoder to address fundamental architectural deficiencies in existing Video-LLMs for fine-grained temporal understanding (e.g., distinguishing "pulling from left to right" vs. "pulling from right to left"), achieving up to 5.5% improvement on VITATECS/MVBench/Video-MME.
FastVID: Dynamic Density Pruning for Fast Video Large Language Models: This paper proposes FastVID, which systematically eliminates video token redundancy along both temporal and visual dimensions via Dynamic Temporal Segmentation (DySeg) and Density Spatiotemporal Pruning (STPrune). On LLaVA-OneVision-7B, FastVID retains 98% accuracy after pruning 90.3% of video tokens, achieving a 7.1× speedup in the LLM prefill stage.
Fixed-Point RNNs: Interpolating from Diagonal to Dense: This paper proposes the Fixed-Point RNN framework, which parameterizes dense linear RNNs as fixed points of diagonal linear RNNs. By varying the number of iterations, the model dynamically interpolates between diagonal (efficient) and dense (expressive) regimes, achieving state-of-the-art results simultaneously on state-tracking ($A_5$/$S_5$) and copying tasks for the first time.
Grounding Foundational Vision Models with 3D Human Poses for Robust Action Recognition: A cross-attention multimodal architecture is proposed that integrates V-JEPA 2 visual context features with CoMotion 3D skeletal pose data, outperforming unimodal baselines on standard and high-occlusion action recognition benchmarks.
InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding: This paper proposes InfiniPot-V, the first training-free and query-agnostic streaming video understanding framework. It achieves online KV cache compression via two complementary metrics — Temporal-axis Redundancy (TaR) and Value-Norm (VaN) — enabling streaming video understanding of arbitrary length under a fixed memory budget.
InFlux: A Benchmark for Self-Calibration of Dynamic Intrinsics of Video Cameras: This paper introduces InFlux, the first real-world video benchmark with per-frame ground-truth dynamic camera intrinsics (386 videos, 143K+ annotated frames). Accurate annotations are achieved via a lookup table (LUT) mapping lens metadata to intrinsic parameters. The benchmark reveals that existing intrinsic prediction methods perform poorly under dynamic intrinsic settings.
INST-IT: Boosting Instance Understanding via Explicit Visual Prompt Instruction Tuning: This work presents the complete Inst-IT framework: a GPT-4o-assisted automatic annotation pipeline for generating instance-level fine-grained data, an Inst-IT Bench evaluation benchmark, a 335K QA-pair instruction tuning dataset, and a continual fine-tuning paradigm that effectively enhances instance-level understanding in LMMs while also improving general image and video comprehension.
KungfuBot: Physics-Based Humanoid Whole-Body Control for Learning Highly-Dynamic Skills: This paper proposes the PBHC framework, which enables a humanoid robot (Unitree G1) to learn highly dynamic whole-body skills such as kung fu and dance through a physics-aware motion processing pipeline and a bi-level optimization scheme for adaptive tracking factors. The approach achieves substantially lower tracking errors than existing methods and is successfully deployed on real hardware.
Lattice Boltzmann Model for Learning Real-World Pixel Dynamicity: Inspired by the Lattice Boltzmann Method from fluid dynamics, this work proposes LBM (Lattice Boltzmann Model) for online real-time pixel tracking. It models video pixels as fluid lattices and solves motion states via collision-streaming processes, achieving SOTA online tracking performance with 18M parameters while enabling real-time inference on edge devices.
Less is More: Local Intrinsic Dimensions of Contextual Language Models: This paper proposes using the Local Intrinsic Dimension (LID) of contextual token embeddings as an unsupervised signal for monitoring LLM training dynamics — a decrease in LID indicates improved generalization, while an increase signals overfitting. The utility of this geometric signal is validated on tasks including dialogue state tracking, grokking, and sentiment recognition.
LiveStar: Live Streaming Assistant for Real-World Online Video Understanding: This paper proposes LiveStar, an always-on live streaming video understanding assistant that achieves adaptive response timing via a Streaming Causal Attention Masks (SCAM) training strategy and a Streaming Verification Decoding (SVeD) inference framework, improving semantic correctness by 19.5% and reducing temporal deviation by 18.1% on the OmniStar benchmark.
MEMTRACK: Evaluating Long-Term Memory and State Tracking in Multi-Platform Dynamic Agent Environments: This paper proposes the MEMTRACK benchmark to evaluate LLM agents' long-term memory and state tracking capabilities in multi-platform dynamic environments (Slack/Linear/Git), revealing that even the strongest model, GPT-5, achieves only 60% accuracy.
MimeQA: Towards Socially-Intelligent Nonverbal Foundation Models: This work introduces MimeQA, the first nonverbal social reasoning benchmark built on mime performance videos. It comprises 101 videos and 806 QA pairs organized across three hierarchical question levels (grounding the imagined → scene-level understanding → global reasoning), and reveals a severe gap between current VideoLLMs and humans on nonverbal social understanding (20–30% vs. 86%).
MUVR: A Multi-Modal Untrimmed Video Retrieval Benchmark with Multi-Level Visual Correspondence: This paper introduces MUVR, a benchmark for multi-modal untrimmed video retrieval targeting real-world long-video platforms. It proposes a video-centric multi-modal query format (video + text + tag + mask) and a six-level visual correspondence matching criterion, comprising 53K videos and 1,050 queries, and systematically evaluates the limitations of retrieval models and MLLMs.
Neural Stochastic Flows: Solver-Free Modelling and Inference for SDE Solutions: This paper proposes Neural Stochastic Flows (NSF), which directly learns the transition distribution $p(x_t \mid x_s)$ of an SDE via conditional normalising flows. The architecture is constrained to satisfy stochastic flow properties (identity, Markov, Chapman-Kolmogorov), enabling single-step sampling without numerical solvers and achieving up to two orders of magnitude speedup at distant time points.
NeuroPath: Neurobiology-Inspired Path Tracking and Reflection for Semantically Coherent Retrieval: Inspired by the hippocampal place cell navigation and memory consolidation mechanisms in neurobiology, this paper proposes NeuroPath—a RAG framework based on semantic path tracking—that achieves average improvements of 16.3% in recall@2 and 13.5% in recall@5 on multi-hop QA tasks through LLM-driven goal-directed path construction and a post-retrieval completion strategy.
Open-World Drone Active Tracking with Goal-Centered Rewards: This paper introduces DAT, the first open-world drone active tracking benchmark comprising 24 city-scale scenes with high-fidelity dynamics simulation, along with GC-VAT, a reinforcement learning tracking method based on goal-centered rewards and curriculum learning, achieving approximately 72% tracking success rate in simulation.
PASS: Path-Selective State Space Model for Event-Based Recognition: PASS proposes the Path-selective Event Aggregation and Scan (PEAS) module and the Multi-faceted Selection Guiding (MSG) loss, leveraging the linear complexity and frequency generalization capability of SSMs to perform event-based recognition across a broad distribution of event lengths from $10^6$ to $10^9$, while limiting performance degradation to only 8.62% under varying inference frequencies (compared to 20.69% for the baseline).
PixFoundation 2.0: Do Video Multi-Modal LLMs Use Motion in Visual Grounding?: By introducing four motion-centric probing techniques and the MoCentric-Bench benchmark, this paper demonstrates that current video multimodal LLMs fail to genuinely exploit motion information in pixel-level visual grounding tasks and can be deceived by static keyframes.
PreFM: Online Audio-Visual Event Parsing via Predictive Future Modeling: This paper introduces the Online Audio-Visual Event Parsing (On-AVEP) paradigm for the first time, along with the PreFM framework, which leverages pseudo-future sequences to enhance current contextual understanding. Combined with modality-agnostic knowledge distillation and focal temporal prioritization, PreFM surpasses offline SOTA methods by +9.3 event-level average F1 score using only 2.7% of their parameter count.
QiMeng-NeuComBack: Self-Evolving Translation from IR to Assembly Code: This paper introduces the NeuComBack benchmark for evaluating neural compilation on IR-to-assembly translation tasks, and proposes a self-evolving prompt optimization method that iteratively improves compilation prompts by learning from LLM self-debugging trajectories. The approach raises correctness from 44% to 64%, with 87.5% of correctly generated programs outperforming clang-O3.
Revisiting Bi-Linear State Transitions in Recurrent Neural Networks: This paper systematically revisits bilinear state transitions in RNNs—i.e., multiplicative interactions between the hidden state and the input—and theoretically proves that bilinear RNNs can simulate arbitrary finite-state machines. By removing additive terms, these models form a natural expressivity hierarchy ranging from diagonal to full-rank structures, revealing that popular linear RNNs such as Mamba occupy the lowest tier of this hierarchy.
SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models: This paper proposes the SAMA framework, which jointly models fine-grained spatio-temporal understanding and grounding in multi-turn referential video dialogue for the first time, through the construction of a unified dataset (SAMA-239K), model (spatio-temporal context aggregator + SAM), and benchmark (SAMA-Bench).
Seeing Beyond the Scene: Analyzing and Mitigating Background Bias in Action Recognition: This paper systematically analyzes background bias in action recognition across three model paradigms — classification models, contrastive pre-trained models (CLIP/SigLIP2), and video large language models (VLLMs) — and proposes two mitigation strategies: a dual-branch architecture that fuses segmented human inputs to reduce SBErr by 3.78% for classification models, and automated prompt tuning to reduce SBErr by 9.85% for VLLMs.
Seeing the Arrow of Time in Large Multimodal Models: This paper reveals that current large multimodal models (LMMs) are surprisingly insensitive to the temporal directionality of video (i.e., the Arrow of Time)—producing nearly identical answers for forward and reversed playback. The authors propose ArrowRL, a GRPO-based training strategy that introduces a reverse video reward to elicit temporal direction awareness, and construct AoTBench for evaluation. The approach achieves significant gains across multiple VQA benchmarks, including a 65.9% relative improvement on Vinoground.
SmartWilds: Multimodal Wildlife Monitoring Dataset: This work introduces SmartWilds, the first synchronously collected multimodal wildlife monitoring dataset, integrating three complementary modalities — drone imagery, camera traps, and bioacoustics — comprising 101 GB of data. Cross-modal alignment is achieved via GPS coordinates and timestamps. The dataset establishes a reproducible standard protocol for conservation monitoring, filling the gap in comprehensive multi-sensor fusion benchmarks for ecosystem-scale ecological research.
Tool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task: This paper proposes the STAR framework, which constructs a video analysis toolbox comprising 22 tools and enables an LLM to alternately invoke temporal and spatial tools to progressively localize a 3D Region of Interest (3D RoI) within videos, achieving improvements of 8.2% on VideoMME and 4.6% on LongVideoBench.
Steering When Necessary: Flexible Steering Large Language Models with Backtracking: This paper proposes FASB (Flexible Activation Steering with Backtracking), a framework that dynamically determines the necessity and intensity of intervention by tracking the internal states of an LLM during generation, and introduces a backtracking mechanism to correct already-deviated tokens. FASB achieves a True*Info score of 80.56% on TruthfulQA and an average accuracy of 78.8% across six multiple-choice tasks, significantly outperforming all baselines.
Structured Sparse Transition Matrices to Enable State Tracking in State-Space Models: This paper proposes PD-SSM, a structured sparse parameterization for the state transition matrix of state-space models (SSMs). The core idea is to factorize the transition matrix as a product of a column-wise one-hot matrix P and a complex diagonal matrix D (i.e., $A = PD$), achieving expressiveness equivalent to unstructured (dense) SSMs while retaining computational efficiency comparable to diagonal SSMs at $\Theta(LN)$. A single layer suffices to simulate any $N$-state finite-state automaton (FSA). The paper provides theoretical guarantees on BIBO stability and optimal state dimensionality, with strong empirical results on FSA simulation, multivariate time-series classification, long-sequence benchmarks, and natural-language state-tracking tasks.
TAPVid-360: Tracking Any Point in 360 from Narrow Field of View Video: This paper introduces the TAPVid-360 task and dataset, requiring models to track the 3D direction of query points (including those outside the field of view) in narrow field-of-view video. By leveraging 360° video to generate training data and fine-tuning CoTracker3 for directional prediction, the proposed approach substantially outperforms existing methods on out-of-field-of-view tracking.
TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs: This paper proposes TempSamp-R1, a mixed-policy reinforcement fine-tuning framework that integrates high-quality off-policy (ground truth) guidance into GRPO's on-policy sampling and introduces nonlinear soft advantage estimation to stabilize training, achieving state-of-the-art performance on video temporal grounding (Charades-STA R1@0.7: 52.9%, ActivityNet R1@0.5: 56.0%).
TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs: TempSamp-R1 is a reinforcement fine-tuning framework that addresses the inefficiency of on-policy sampling in GRPO for video temporal grounding—caused by the vast search space—by introducing ground-truth annotations as off-policy supervision signals, non-linear soft advantage estimation, and a hybrid CoT training paradigm, achieving new state-of-the-art results on Charades-STA, ActivityNet, and QVHighlights.
The Ouroboros of Benchmarking: Reasoning Evaluation in an Era of Saturation: Through a systematic analysis of 52 reasoning benchmarks across three major model families—OpenAI, Anthropic, and Google—this paper identifies an "ouroboros" cycle: old benchmarks are rapidly saturated → new benchmarks are created to restore discriminability → new benchmarks are rapidly saturated in turn. This cycle calls into question whether improvements in benchmark scores genuinely reflect generalized reasoning ability or merely overfit to specific evaluation sets.
TiRex: Zero-Shot Forecasting Across Long and Short Horizons with Enhanced In-Context Learning: This paper proposes TiRex, a pretrained time series forecasting model based on xLSTM. By introducing a Contiguous Patch Masking (CPM) strategy and data augmentation techniques, TiRex with only 35M parameters comprehensively outperforms larger models such as Chronos Bolt (200M) and TimesFM (500M) on the GiftEval and Chronos-ZS benchmarks, achieving state-of-the-art performance in both short- and long-horizon zero-shot forecasting.
Token Bottleneck: One Token to Remember Dynamics: This paper proposes Token Bottleneck (ToBo), a self-supervised visual representation learning pipeline that compresses a reference scene into a single bottleneck token and uses this token together with a minimal number of target scene patches to reconstruct the subsequent scene, thereby training visual backbone networks to simultaneously encode scene information conservatively and capture temporal dynamics.
Tool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task: This paper proposes a video toolkit comprising 22 tools and the STAR (Spatiotemporal Reasoning) framework, which progressively localizes a 3D Region of Interest (RoI) via an alternating temporal–spatial tool scheduling strategy. The approach improves GPT-4o by 8.2% on VideoMME while substantially reducing the number of processed frames and computational overhead.
Tracking and Understanding Object Transformations: This paper introduces the Track Any State task and the TubeletGraph zero-shot framework, which tracks objects undergoing drastic appearance changes in video (e.g., an apple being cut, a butterfly emerging from a chrysalis) while simultaneously detecting and describing these transformations.
TrackingWorld: World-centric Monocular 3D Tracking of Almost All Pixels: This paper presents TrackingWorld, a pipeline for dense 3D tracking of almost all pixels from monocular video. It lifts sparse 2D trajectories to dense ones via a tracking upsampler, iteratively tracks newly appearing objects across all frames, and employs an optimization-based framework to lift 2D trajectories into world-coordinate 3D space with explicit decoupling of camera motion and object motion.
Two Causally Related Needles in a Video Haystack: This paper proposes Causal2Needles, a benchmark of 4,100 QA pairs that binds the understanding of two causally related events via a "bridging entity," forcing VLMs to jointly retrieve and reason over two needles scattered across a long video. It reveals severe deficiencies in state-of-the-art models on the causal dual-needle task (ChatGPT-4o achieves only 13.4% Both accuracy on the dual-needle setting).
Unleashing Hour-Scale Video Training for Long Video-Language Understanding: This work constructs VideoMarathon, the first large-scale hour-level video instruction-following dataset (9,700 hours, 3.3M QA pairs, 22 task types), and proposes Hour-LLaVA, a model that leverages a memory repository, forgetting mechanism, and MemAug module to enable efficient training and inference on hour-scale videos at 1 FPS, achieving state-of-the-art results among open-source models of comparable scale across four long video benchmarks.
VGEnt: Graph-Based Retrieval-Reasoning-Augmented Generation for Long Video Understanding: This paper proposes VGEnt, a graph-based retrieval-reasoning-augmented generation framework that constructs a video knowledge graph to preserve cross-segment semantic relationships, and introduces structured reasoning steps to filter noise and aggregate information. VGEnt consistently improves open-source LVLMs by 3.0%–5.4% across multiple long video understanding benchmarks and outperforms existing video RAG methods by 8.6%.
Video Finetuning Improves Reasoning Between Frames: This paper proposes a visual chain-of-thought (vCoT) approach to systematically compare image LLMs and video-finetuned LLMs on inter-frame reasoning. It finds that video finetuning enables models to implicitly learn inter-frame transition reasoning, and that this capability transfers to relational reasoning tasks on static images.
VideoLucy: Deep Memory Backtracking for Long Video Understanding: This paper proposes VideoLucy, a framework that simulates the human coarse-to-fine recall process via a hierarchical memory structure and an agent-based iterative backtracking mechanism. VideoLucy substantially outperforms existing methods on multiple long video understanding benchmarks, surpassing even commercial models such as GPT-4o.
Web-Scale Collection of Video Data for 4D Animal Reconstruction: This paper proposes a fully automated large-scale video data collection pipeline that mines and processes 30K animal videos (2M frames) from YouTube, establishes the first 4D quadruped animal reconstruction benchmark Animal-in-Motion (230 sequences / 11K frames), and introduces a baseline method 4D-Fauna that achieves model-free 4D reconstruction via sequence-level optimization.
When One Moment Isn't Enough: Multi-Moment Retrieval with Cross-Moment Interactions: This paper proposes the QV-M2 dataset (the first fully human-annotated multi-moment retrieval benchmark) and the FlashMMR framework (incorporating a Post-Verification Module), extending video moment retrieval from single-moment to multi-moment scenarios and establishing a standardized evaluation protocol for multi-moment retrieval.
When Thinking Drifts: Evidential Grounding for Robust Video Reasoning: This paper systematically identifies the "visual thinking drift" phenomenon in which CoT reasoning frequently degrades performance in video understanding, and proposes the Visual Evidence Reward (VER) reinforcement learning framework that corrects this problem by explicitly rewarding reasoning chains grounded in visual evidence.

🤖 Robotics & Embodied AI¶

A Snapshot of Influence: A Local Data Attribution Framework for Online Reinforcement Learning: This work is the first to introduce data attribution into online reinforcement learning. It proposes a local attribution framework to quantify each training record's contribution to policy updates, and builds upon it an Iterative Influence Filtering (IIF) algorithm that substantially improves sample efficiency and final performance on both classical RL benchmarks and LLM RLHF.
Adaptive Frontier Exploration on Graphs with Applications to Network-Based Disease Testing: This paper proposes the Adaptive Frontier Exploration on Graphs (AFEG) framework and designs a Gittins index-based policy that is provably optimal when the graph is a forest. On real-world sexually transmitted disease testing networks, the method identifies nearly all HIV-positive individuals by testing only half the population, substantially outperforming greedy and DQN baselines.
AutoToM: Scaling Model-based Mental Inference via Automated Agent Modeling: AutoToM achieves fully automated model-based Theory of Mind inference—without requiring manual agent model specification—by automatically proposing Bayesian network structures and executing Bayesian inverse planning. Through uncertainty-driven iterative model refinement (adding mental variables or extending time steps), it achieves an average accuracy of 82.43% across 5 ToM benchmarks, surpassing SOTA models such as GPT-4o (63.39%) and o3-mini (73.94%).
Beyond Parallelism: Synergistic Computational Graph Effects in Multi-Head Attention: This paper reframes multi-head attention as a system of multiple feedforward DAGs sharing a common sink node, and theoretically demonstrates that multiple heads can achieve synergistic effects through cross-head paths—reducing mixing time and amplifying minimax fidelity—with empirical validation on sequential operation tasks.
Breaking the Gradient Barrier: Unveiling Large Language Models for Strategic Classification: This paper proposes GLIM (Gradient-free Learning In-context Method), which for the first time leverages the In-Context Learning (ICL) mechanism of LLMs to implicitly simulate the bi-level optimization in strategic classification (feature manipulation + decision rule optimization), enabling efficient strategic classification on large-scale data without any fine-tuning.
C-NAV: Towards Self-Evolving Continual Object Navigation in Open World: The paper proposes C-Nav, a framework that employs dual-path anti-forgetting (feature distillation + feature replay) and adaptive experience selection (LOF-based anomaly detection for keyframe selection) to prevent catastrophic forgetting as a navigation agent incrementally learns new object categories, surpassing full data replay baselines across 4 different architectures.
Can Agents Fix Agent Issues?: This paper presents the first systematic study of automated issue resolution in LLM-based agent systems. Through manual analysis of 201 real-world agent issues, the authors construct a taxonomy comprising 6 categories and 20 subcategories, invest approximately 500 person-hours to build AgentIssue-Bench—a benchmark of 50 reproducible tasks—and find that state-of-the-art software engineering (SE) agents (e.g., SWE-agent, Agentless, AutoCodeRover) achieve correct resolution rates of only 3.33%–12.67% on agent issues, far below their 23%–51% rates on conventional software.
CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification: CogVLA proposes a three-stage VLA architecture inspired by human multimodal cognition—comprising EFA-Routing for visual aggregation and compression to 25%, LFP-Routing for instruction-aware pruning of 50% of tokens within the LLM, and V-L-A coupled attention—achieving a 97.4% success rate on LIBERO with 2.5× training and 2.8× inference speedups over SOTA methods such as OpenVLA-OFT, and a 70.0% success rate on real-robot tasks.
C-NAV: Towards Self-Evolving Continual Object Navigation in Open World: This paper proposes C-Nav, a continual object navigation framework that employs a dual-path anti-forgetting mechanism (feature distillation + feature replay) and LOF-based adaptive experience selection to enable navigation agents to incrementally learn new object categories while effectively mitigating catastrophic forgetting. C-Nav surpasses full data replay baselines across 4 mainstream architectures and 2 datasets.
COOPERA: Continual Open-Ended Human-Robot Assistance: This paper proposes the COOPERA framework, the first to enable continual, open-ended human-robot collaboration research. LLM-driven simulated humans with psychological traits and long-term intentions interact with robots over multiple days in a 3D environment. The robot progressively improves its personalized assistance by learning human characteristics and contextual intentions.
DexFlyWheel: A Scalable Self-Improving Data Generation Framework for Dexterous Manipulation: This paper proposes DexFlyWheel, a dexterous manipulation data generation framework that starts from a single human demonstration and progressively scales data diversity through a self-improving loop composed of IL, residual RL, and data augmentation. The framework generates 2,000+ demonstrations across 4 tasks, achieving an average policy success rate of 81.9% and a real-world transfer success rate of 78.3%.
DynaNav: Dynamic Feature and Layer Selection for Efficient Visual Navigation: DynaNav is proposed to dynamically adjust feature and layer usage according to scene complexity via a trainable hard feature selector and a Bayesian optimization-based early-exit mechanism, achieving a 2.26× FLOPs reduction and 42.3% inference time decrease in visual navigation while maintaining or improving navigation performance.
EfficientNav: Towards On-Device Object-Goal Navigation with Navigation Map Caching and Retrieval: Through discrete memory caching (group-independent KV cache computation with selective loading), attention-driven clustering (LLM shallow-layer attention guiding grouping), and semantics-aware retrieval (CLIP + knapsack problem adapted to varying memory budgets), EfficientNav is the first system to achieve zero-shot ObjNav on Jetson Orin using LLaMA-3.2-11b, surpassing the GPT-4 baseline by 11.1% SR while reducing real-time latency by 6.7×.
EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT: This paper proposes EgoThinker, which constructs the large-scale egocentric video reasoning dataset EgoRe-5M (with causal CoT annotations and hand-object grounding labels) and adopts a two-stage training paradigm (SFT + GRPO reinforcement fine-tuning) to endow MLLMs with robust egocentric reasoning, hand-object grounding, and temporal localization capabilities, achieving state-of-the-art performance across multiple egocentric benchmarks.
EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT: EgoThinker constructs EgoRe-5M, a 5-million-sample egocentric video QA dataset with causal CoT annotations and fine-grained hand-object localization data. Through a two-stage training paradigm—SFT for reasoning followed by GRPO for grounding—the approach enables a 7B MLLM to simultaneously perform egocentric causal reasoning and spatio-temporal fine-grained localization for the first time, achieving state-of-the-art results on 8+ benchmarks, with the 7B model surpassing 72B models on temporal grounding.
Explaining and Mitigating Crosslingual Tokenizer Inequities: This work systematically trains approximately 7,000 monolingual tokenizers covering 97 languages, providing the first demonstration that significant token premium disparities persist across languages even after controlling for training data size, vocabulary size, and algorithm. It further identifies vocabulary size and pre-tokenization strategy as key contributing factors, and proposes two mitigation approaches: language-specific optimal vocabulary size and SuperBPE.
FALCON: Fine-grained Activation Manipulation by Contrastive Orthogonal Unalignment for Large Language Model: This paper proposes FALCON, a representation-guided LLM unlearning framework that employs mutual information for parameter selection, a contrastive mechanism for fine-grained knowledge separation, and gradient orthogonal projection to resolve forgetting–retention conflicts. FALCON consistently outperforms existing methods on harmful knowledge, copyright, and entity unlearning benchmarks.
Generalizable Domain Adaptation for Sim-and-Real Policy Co-Training: This paper proposes a sim-and-real policy co-training framework based on Unbalanced Optimal Transport (UOT), which aligns the joint observation-action distribution (rather than only the marginal observation distribution), and incorporates a temporally aligned sampling strategy to handle data imbalance, achieving a 30% improvement in OOD generalization on robotic manipulation tasks.
Grasp2Grasp: Vision-Based Dexterous Grasp Translation via Schrödinger Bridges: This paper proposes modeling cross-morphology visual dexterous grasp transfer as a Schrödinger Bridge problem. By learning Score and Flow Matching ([SF]²M) in a latent space and designing physics-aware optimal transport cost functions (over pose, contact maps, grasp wrench space, and Jacobian manipulability), the method achieves distribution-level transfer of grasp intent across different robot hands without requiring paired data.
GUI-Rise: Structured Reasoning and History Summarization for GUI Navigation: This paper proposes GUI-Rise, a framework that jointly designs three subtasks—structured reasoning (progress estimation + decision reasoning), action prediction, and history summarization—combined with GRPO reinforcement learning and a history summarization reward, to significantly improve the cross-domain generalization of GUI navigation agents.
Harnessing the Computation Redundancy in ViTs to Boost Adversarial Transferability: By systematically exploiting data-level and model-level computation redundancy in ViTs, this paper proposes five techniques—attention sparsification, attention head permutation, clean token regularization, Ghost MoE diversification, and robust tokens—combined with an online learning strategy that dynamically selects operations. The method achieves an average fooling rate of 86.9% on ImageNet-1K, substantially outperforming all baselines.
HiMaCon: Discovering Hierarchical Manipulation Concepts from Unlabeled Multi-Modal Data: This paper proposes a self-supervised framework that learns hierarchical manipulation concepts from unlabeled multi-modal robot demonstrations. It organizes representations via a cross-modal correlation network and a multi-horizon future predictor, enhancing the generalization of imitation learning policies to novel objects, unseen obstacles, and new environments.
Knolling Bot: Teaching Robots the Human Notion of Tidiness: This work frames desktop object tidying (knolling) as an NLP-style sequence prediction task, employing a Transformer to autoregressively generate target poses for each object. A Gaussian Mixture Model (GMM) handles solution ambiguity, the model is trained on 2.4 million automatically generated demonstrations to learn a generalizable notion of tidiness, and user preferences are implicitly encoded via the input ordering of objects.
LabUtopia: High-Fidelity Simulation and Hierarchical Benchmark for Scientific Embodied Agents: This paper proposes LabUtopia — a high-fidelity simulation and hierarchical benchmark suite for scientific laboratory environments. It comprises the LabSim simulator with chemical reaction modeling, LabScene for procedural laboratory scene generation, and LabBench, a five-level benchmark spanning atomic operations to long-horizon mobile manipulation. The suite reveals significant bottlenecks in existing imitation learning methods with respect to long-horizon experimental workflows and object generalization.
LatentGuard: Controllable Latent Steering for Robust Refusal of Attacks and Reliable Response Generation: This paper proposes LatentGuard, a three-stage framework that combines behavior-level alignment fine-tuning, structured VAE-supervised latent space modeling, and latent-space dimensional manipulation to achieve interpretable and controllable regulation of LLM refusal behavior — robustly defending against adversarial attacks while preserving responsiveness to benign queries.
Learning Spatial-Aware Manipulation Ordering: This paper proposes OrderMind, a unified framework that learns manipulation ordering of objects in cluttered scenes directly from RGB-D images via a Spatial Context Understanding encoder and a Temporal Priority Structuring module. Training annotations are generated through VLM distillation with spatial priors. OrderMind significantly outperforms VLM baselines in both simulation and real-world environments while supporting real-time inference (5.6 FPS, 21.3 FPS for the lightweight variant).
LLM World Models Are Mental: Output Layer Evidence of Brittle World Model Use in LLM Mechanical Reasoning: Drawing on cognitive science methodology for studying mental models, this work evaluates LLM mechanical reasoning ability using TikZ code representations of pulley systems. LLMs can approximately estimate mechanical advantage and distinguish functional from non-functional systems (Studies 1 & 2), but completely fail at fine-grained structural connectivity reasoning (Study 3), indicating that LLM "world models" exist but are brittle.
LLMscape: LLMscape is a projection-mapped sandscape interactive installation in which multiple independent LLM agents receive multimodal input, converse with one another, and engage in speculation within a shared, mutable physical environment, exploring the process of collaborative sensemaking between humans and AI under cognitive uncertainty.
Manipulating Feature Visualizations with Gradient Slingshots: This paper proposes Gradient Slingshots (GS), a method that "carves" a quadratic activation landscape in the out-of-distribution (OOD) input region of a model, directing the gradient-based optimization of Feature Visualization (FV) toward an arbitrary target image. The approach causes FV to converge to a predefined spurious image while leaving the model's architecture, classification accuracy, and internal feature representations largely intact, thereby exposing a serious vulnerability of FV as a model auditing tool.
MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning: This paper proposes the MesaTask framework, which decomposes task descriptions into a Spatial Reasoning Chain — object reasoning → spatial relationship reasoning → scene graph construction → 3D layout — and combines a 10K+ manually annotated dataset with DPO optimization to generate physically plausible, task-aligned tabletop manipulation scenes.
MindForge: Empowering Embodied Agents with Theory of Mind for Lifelong Cultural Learning: MindForge introduces explicit Theory of Mind (ToM) representations, natural language communication, and a multi-component memory system into LLM-driven embodied agents, enabling open-source LLM agents to substantially improve task completion rates through collaborative dialogue with expert agents (without gradient updates), achieving 3× more tech-tree milestones and 2.3× more unique items than Voyager in Minecraft.
MineAnyBuild: Benchmarking Spatial Planning for Open-world AI Agents: MineAnyBuild is a spatial planning benchmark built upon Minecraft, requiring AI agents to generate executable blueprint matrices from multimodal instructions. The benchmark comprises 4,000 tasks and 500+ architectural/decorative assets, and systematically evaluates MLLM spatial planning capabilities across four dimensions: spatial understanding, spatial reasoning, creativity, and spatial commonsense. Results reveal that even GPT-4o achieves only 41.02/100 overall, with open-source models performing substantially worse.
MIP against Agent: Malicious Image Patches Hijacking Multimodal OS Agents: This paper reveals a novel adversarial attack against multimodal OS Agents, termed MIP (Malicious Image Patches): visually imperceptible adversarial perturbation patches (occupying approximately 1/7 of the screen area) are embedded in screenshots, causing the OS Agent to output a predefined sequence of malicious API calls upon capture. Joint optimization enables universal generalization across user instructions and screen layouts, achieving an attack success rate of up to 100%.
MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark: This paper introduces MMTU, a large-scale benchmark comprising 28,136 questions spanning 25 real-world table tasks, designed to systematically evaluate LLMs on professional-level table understanding, reasoning, and manipulation. Even frontier reasoning models such as GPT-5 achieve only approximately 69.6% on this benchmark.
mmWalk: Towards Multi-modal Multi-view Walking Assistance: mmWalk constructs the first multi-modal multi-view dataset for walking assistance targeting blind and low-vision (BLV) individuals (62K frames / 559K panoramic images generated via the CARLA simulator, plus 69K VQA pairs), and benchmarks reveal that state-of-the-art VLMs perform inadequately on safety-critical tasks such as risk assessment and navigation landmark recognition (best accuracy only 55.21%); fine-tuning yields a 16.7% generalization improvement on real-world datasets.
NeSyPr: Neurosymbolic Proceduralization For Efficient Embodied Reasoning: NeSyPr proposes a neurosymbolic proceduralization framework that transforms task plans generated by symbolic planners into composable procedural representations, enabling compact language models to perform efficient single-step reasoning without relying on external symbolic guidance — analogous to the human process of knowledge compilation.
Operation Veja: Fixing Fundamental Concepts Missing from Modern Roleplaying Training Paradigms: This paper systematically critiques four dominant paradigms in role-playing (RP) model training—RAG, fact-value specification, literary data, and synthetic data—arguing that none can produce characters with genuine depth. It proposes the VEJA framework (Values–Experiences–Judgments–Abilities) as a structured basis for character definition and data curation. In an LLM-judged A/B test, VEJA-guided human-curated data significantly outperforms a Gemini Pro 2.5 synthetic baseline with a win/loss/tie ratio of 43:28:29.
Policy Compatible Skill Incremental Learning via Lazy Learning Interface: This paper proposes SIL-C, a framework that achieves skill-policy compatibility in skill incremental learning via a bilateral lazy learning interface, enabling incrementally updated skills to directly improve downstream policy performance without retraining or structural modification.
Predicting the Performance of Black-Box LLMs through Follow-Up Queries: This paper proposes QueRE, a method that poses approximately 50 follow-up questions to a black-box LLM (e.g., "Are you confident in your answer?") and uses the resulting "Yes" token probabilities as features to train a linear classifier. QueRE achieves strong performance on predicting model correctness, detecting adversarial manipulation, and distinguishing between different LLMs — surpassing even white-box methods that require access to internal model states.
UniDomain: Pretraining a Unified PDDL Domain from Real-World Demonstrations for Generalizable Task Planning: UniDomain pretrains a unified PDDL planning domain—comprising 3,137 operators and 2,875 predicates—from 12,393 real-world robotic manipulation videos. Through hierarchical fusion to construct a meta-domain, it achieves zero-shot cross-task symbolic planning, outperforming the strongest baseline by 58% in success rate and 160% in plan optimality.
RDD: Retrieval-Based Demonstration Decomposer for Planner Alignment in Long-Horizon Tasks: This paper proposes RDD (Retrieval-Based Demonstration Decomposer), which models demonstration decomposition as an optimal partition problem and automatically segments long-horizon task demonstrations into subtasks aligned with the training data of low-level visuomotor policies. This approach bridges the gap between high-level planners and low-level policies in hierarchical VLA frameworks, achieving near-expert-decomposer performance on RLBench.
Redefining Experts: Interpretable Decomposition of Language Models for Toxicity Mitigation: This paper proposes EigenShift, a method that performs SVD decomposition on the final output projection layer of LLMs to identify semantic directions (eigen-choices) associated with toxic generation, and suppresses toxicity by selectively attenuating the corresponding singular values. On LLaMA-2, EigenShift reduces toxicity by 58% while increasing perplexity by only 3.62, achieving a favorable balance between safety and fluency.
Rethinking the Simulation vs. Rendering Dichotomy: No Free Lunch in Spatial World Modelling: From a cognitive neuroscience perspective, this paper challenges the prevailing view that simulation and rendering are separable processes: it argues that spatial reasoning relies on fine-grained perceptual representations rather than coarse abstractions, and concludes that AI spatial world models likewise require rich perceptual detail — there is no free lunch in spatial modelling.
RoboCerebra: A Large-scale Benchmark for Long-horizon Robotic Manipulation Evaluation: This paper proposes RoboCerebra, a long-horizon robotic manipulation benchmark comprising 1,000 human demonstration trajectories (averaging 2,972 steps, approximately 6× longer than existing benchmarks). Through a hierarchical planning and execution framework and a multi-dimensional evaluation protocol, it systematically assesses VLMs across three System 2 cognitive dimensions: planning, reflection, and memory.
SAFE: Multitask Failure Detection for Vision-Language-Action Models: SAFE identifies consistent "failure regions" in the internal feature space of VLA models that generalize across tasks. Leveraging this observation, it trains lightweight MLP/LSTM failure detectors and applies Functional Conformal Prediction (FCP) for threshold calibration. The approach achieves 78% ROC-AUC on unseen tasks with less than 1% computational overhead, substantially outperforming token-uncertainty and action-consistency baselines.
SegMASt3R: Geometry Grounded Segment Matching: SegMASt3R augments the pretrained MASt3R 3D foundation model with a lightweight segmentation feature head and a differentiable Sinkhorn matching layer. By leveraging 3D geometric priors, it achieves robust semantic segment matching under extreme viewpoint changes (up to 180°), attaining an AUPRC of 83.6% on the 135–180° baseline (vs. 17% for SAM2).
Spatial Understanding from Videos: Structured Prompts Meet Simulation Data: This paper proposes a two-pronged approach combining the SpatialMind structured prompting strategy and the ScanForgeQA synthetic QA dataset to substantially enhance VLMs' ability to perform 3D spatial reasoning from scanned videos, without modifying the underlying model architecture.
SutureBot: A Precision Framework & Benchmark for Autonomous End-to-End Suturing: This paper presents SutureBot — the first precision-oriented benchmark and goal-conditioned framework for end-to-end autonomous suturing on the da Vinci surgical robot. It releases a high-fidelity dataset of 1,890 demonstrations, achieves 59%–74% improvements in needle insertion accuracy via point-label goal conditioning, and systematically evaluates state-of-the-art VLA models including π0, GR00T N1, OpenVLA-OFT, and multi-task ACT.
Talk2Event: Grounded Understanding of Dynamic Scenes from Event Cameras: Talk2Event introduces the first large-scale visual grounding benchmark for event cameras (30,690 annotated referring expressions across four grounding attribute types), and proposes the EventRefer framework, which employs a Mixture of Event-Attribute Experts (MoEE) to dynamically fuse appearance, status, viewer-relation, and inter-object-relation features. EventRefer surpasses existing methods across all three evaluation settings: event-only, frame-only, and fusion.
Task-Optimized Convolutional Recurrent Networks Align with Tactile Processing in the Rodent Brain: This paper proposes the Encoder-Attender-Decoder (EAD) framework to systematically explore task-optimized temporal neural networks for tactile processing. It finds that convolutional recurrent networks (ConvRNNs, especially IntersectionRNN) outperform feedforward and state-space models on both tactile object classification and neural alignment with rodent somatosensory cortex. Contrastive self-supervised learning with tactile-specific augmentations achieves neural fitting comparable to supervised learning, providing the first quantitative characterization of the brain's computational mechanisms for touch.
ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning: ThinkAct proposes a dual-system framework that applies action-aligned visual rewards to fine-tune MLLMs via reinforcement learning, eliciting embodied reasoning capabilities and compressing reasoning plans into visual latent representations to guide a downstream action model—realizing a "think before act" VLA reasoning paradigm.
Toward Engineering AGI: Benchmarking the Engineering Design Capabilities of LLMs: This paper introduces EngDesign—the first LLM engineering design benchmark spanning 9 engineering domains (operating systems, computer architecture, control systems, mechanical engineering, structural engineering, digital hardware, analog circuits, robotics, and signal processing)—replacing conventional QA matching with a simulation-driven evaluation pipeline. The benchmark reveals that even the most capable reasoning model, o3, achieves only a 34% pass rate.
Towards Reliable Code-as-Policies: A Neuro-Symbolic Framework for Embodied Task Planning: This paper proposes a neuro-symbolic framework for embodied task planning that augments LLM-based code generation with explicit symbolic verification (checking whether preconditions are satisfied) and interactive verification (active exploration to acquire missing information), enabling more reliable code execution in dynamic and partially observable environments. On RLBench, task success rate improves from a baseline of 38.5% to 84.7%, with executability reaching 86.8%.
Uncovering Strategic Egoism Behaviors in Large Language Models: This paper presents the first formal definition of Strategic Egoism (SE) in LLMs and introduces SEBench, a benchmark comprising 160 scenarios across 6 SE dimensions. Experiments on 7 mainstream LLMs show that, under incentive-driven conditions, an average of 69.11% of decisions favor self-interested strategies. Manipulation/coercion and rule circumvention are the most prevalent tactics, and SE tendency is positively correlated with toxic language generation.
Understanding Prompt Tuning and In-Context Learning via Meta-Learning: This paper systematically analyzes the theoretical foundations and limitations of prompt tuning from a Bayesian meta-learning perspective. It proves that soft prompts can achieve optimal adaptation on a single target task within the pretraining distribution, yet face fundamental limitations under multi-task mixture target distributions. Furthermore, soft prefixes can surpass the optimal hard-token sequence by manipulating activations outside the token space.
VITRIX-CLIPIN: Enhancing Fine-Grained Visual Understanding in CLIP via Instruction Editing Data and Long Captions: This paper proposes the CLIP-IN framework, which leverages instruction editing datasets as hard negatives and incorporates long captions to enhance CLIP's fine-grained visual understanding. The approach achieves significant improvements on benchmarks such as MMVP without compromising zero-shot performance, and when integrated into MLLMs, it substantially reduces visual hallucinations.
Zero-Shot Embedding Drift Detection: A Lightweight Defense Against Prompt Injections in LLMs: This paper proposes ZEDD (Zero-shot Embedding Drift Detection), which detects prompt injection attacks by measuring semantic drift between benign and suspicious inputs in the embedding space. It leverages GMM/KDE to automatically determine detection thresholds, achieving >93% detection accuracy with <3% false positive rate across multiple LLM architectures.

⚖️ Alignment & RLHF¶

A Systematic Evaluation of Preference Aggregation in Federated RLHF for Pluralistic Alignment of LLMs: This paper proposes an Adaptive Alpha aggregation strategy that dynamically adjusts reward weights based on each user group's historical alignment performance within a federated RLHF framework, simultaneously achieving high fairness and strong alignment performance for pluralistic preference alignment.
Adjacent Words, Divergent Intents: Jailbreaking Large Language Models via Task Concurrency: This paper proposes JAIL-CON, a jailbreak attack framework based on task concurrency. By interleaving harmful and benign tasks at the word level, it exploits LLMs' ability to handle concurrent tasks to bypass safety mechanisms, while the resulting concurrent outputs exhibit stronger evasiveness against guardrails.
Alignment of Large Language Models with Constrained Learning: This paper proposes CAID (Constrained Alignment via Iterative Dualization), an iterative dualization method that alternately updates the LLM policy and dual variables. It theoretically establishes that the dual approach can identify the optimal constrained LLM policy (up to a parametrization gap), and empirically demonstrates significant improvements in constraint satisfaction and the helpfulness–safety trade-off on the PKU-SafeRLHF dataset.
Ask a Strong LLM Judge when Your Reward Model is Uncertain: This paper proposes an uncertainty-based routing framework that applies SNGP to a pairwise reward model for uncertainty quantification, routing high-epistemic-uncertainty samples to a strong LLM judge (DeepSeek-R1). At a judge invocation cost of only 9.2%–42.5%, the approach significantly outperforms random routing in accuracy and demonstrably improves downstream online RLHF alignment.
Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs: This paper proposes a two-stage fine-tuning attack: Stage 1 fine-tunes an LLM on 10 benign questions paired with identical refusal answers, driving the model to overfit into a sharp loss landscape; Stage 2 fine-tunes the same 10 questions with normal answers, triggering catastrophic forgetting of safety alignment. Using entirely benign data, the method achieves a 94.84% attack success rate (ASR), comparable to malicious fine-tuning (97.25%), while completely evading content moderation.
Can DPO Learn Diverse Human Values? A Theoretical Scaling Law: This paper establishes a theoretical generalization framework for DPO under diverse human value settings. By analyzing the dynamic trajectory of reward margins after a finite number of gradient steps, it proves that the number of samples required per value must grow logarithmically with the number of value categories $K$ (i.e., $Q = \Theta(\log K)$) to maintain generalization performance, thereby revealing the statistical cost of aligning with diverse societal values.
Capturing Individual Human Preferences with Reward Features: This paper proposes the Reward Feature Model (RFM), which learns shared reward features $\phi_\theta(x,y)$ such that each user obtains a personalized reward $r_h = \langle \phi_\theta, \mathbf{w}_h \rangle$ via a linear weight vector $\mathbf{w}_h$. The work provides the first PAC generalization bound for multi-annotator preference learning, proving that increasing the number of annotators $m$ is more effective than increasing per-annotator sample count $n$, and that as few as 30 samples suffice for fast adaptation to new users.
DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO: This paper proposes DeepVideo-R1, which reformulates GRPO as Reg-GRPO that directly regresses advantage values (eliminating clipping/min safeguards), and mitigates the vanishing advantage problem via difficulty-aware data augmentation, achieving up to 10.1 percentage points improvement over standard GRPO on video reasoning tasks.
DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models: This paper identifies and addresses the motion bias problem in video DPO — by constructing structurally aligned video pairs via noising and denoising GT videos to fix the motion dimension, annotating dense preferences at the temporal segment level for more precise learning signals, and leveraging off-the-shelf VLMs for automatic annotation to reduce cost. Using only 1/3 of the annotation data, the method substantially improves motion generation quality while matching visual quality and text alignment.
Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization: This paper proposes the Latent Reward Model (LRM) and Latent Preference Optimization (LPO), which repurpose the pretrained diffusion model itself as a noise-aware latent-space reward model to perform step-level preference optimization directly in the noisy latent space. Compared to Diffusion-DPO, LPO achieves a 10–28× training speedup; compared to SPO, it achieves a 2.5–3.5× speedup.
DP²O-SR: Direct Perceptual Preference Optimization for Real-World Image Super-Resolution: This paper proposes DP²O-SR, a framework that exploits the inherent stochasticity of diffusion models to generate diverse super-resolution outputs, constructs preference pairs via a hybrid perceptual reward, and introduces a Hierarchical Preference Optimization (HPO) strategy to adaptively weight training pairs — significantly improving perceptual quality in real-world image super-resolution without any human annotations.
EvoRefuse: Evolutionary Prompt Optimization for Evaluation and Mitigation of LLM Over-Refusal to Pseudo-Malicious Instructions: This paper proposes EvoRefuse—a framework that employs evolutionary search (mutation/recombination + ELBO fitness function + simulated annealing) to automatically generate semantically benign yet reliably refusal-triggering "pseudo-malicious" instructions. The resulting EvoRefuse-Test benchmark achieves 85.34% higher refusal trigger rate and 34.86% greater lexical diversity than the strongest baseline, while the EvoRefuse-Align dataset reduces over-refusal by 29.85%–45.96% via SFT/DPO fine-tuning without compromising safety.
From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring: This paper proposes the Streaming Content Monitor (SCM)—the first harmful content monitor natively designed for partial detection. Built upon the FineHarm dataset (29K samples with token-level annotations) and hierarchical consistency-aware learning, SCM achieves a macro F1 of 0.95+ after observing on average only 18% of response tokens, enabling real-time early stopping of harmful LLM outputs.
g-DPO: Scalable Preference Optimization for Protein Language Models: To address the quadratic growth of preference pairs with respect to sample size when applying DPO to protein language models (PLMs)—which renders training intractable—this paper proposes g-DPO: (1) redundant preference pairs are pruned via union-mask-based clustering in sequence space, retaining more informative comparisons within local neighborhoods; (2) grouped likelihood amortization via shared union masks enables computation of log-likelihoods for all sequences within a group in a single forward pass. Across three protein engineering tasks, g-DPO achieves statistically indistinguishable in silico and in vitro performance compared to standard DPO, while delivering 1.7–5.4× training speedups.
GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs: This paper proposes GASP, a framework that trains a dedicated SuffixLLM to generate human-readable adversarial suffixes. It employs Latent Bayesian Optimization (LBO) to efficiently search the continuous embedding space and iteratively fine-tunes the generator via ORPO, achieving high attack success rates in a fully black-box setting while maintaining suffix readability.
Generalizing while Preserving Monotonicity in Comparison-based Preference Learning Models: This paper proposes Linear GBT with Diffusion Prior, a class of preference learning models that simultaneously guarantee monotonicity (the score of the preferred item never paradoxically decreases after a comparison) and generalization to uncompared items, thereby affirmatively answering the central question of whether generalization and monotonicity can coexist.
Greedy Sampling Is Provably Efficient for RLHF: This paper proves that, under KL-regularized RLHF, directly applying greedy sampling based on empirical estimates—without constructing optimistic or pessimistic confidence sets—achieves $O(\log T)$ regret in the online setting and $O(\varepsilon^{-1})$ sample complexity in the offline setting. These are the first results of such order under general preference models.
GVPO: Group Variance Policy Optimization for Large Language Model Post-Training: GVPO is a more stable LLM post-training method than GRPO, derived by embedding the analytical solution of KL-constrained reward maximization into gradient weights (zero-sum weights eliminate the partition function). It achieves 20.72% on AIME (vs. GRPO's 14.79%) and is proven to possess a unique global optimum.
Human-assisted Robotic Policy Refinement via Action Preference Optimization: This paper proposes Action Preference Optimization (APO), a human-robot collaboration framework that collects interactive trajectories and applies preference alignment to VLA models using binary desirability signals grounded in prospect theory and an adaptive reweighting scheme, enabling the model to learn from failures and improve iteratively.
Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay: Two complementary techniques are proposed to improve the data efficiency of LLM reinforcement fine-tuning (GRPO): (1) DOTS — an attention-based mechanism for predicting adaptive difficulty that prioritizes moderate-difficulty questions to maximize gradient signal; and (2) Rollout Replay — reusing recent rollouts to reduce per-step computational overhead. Together, these techniques reduce training time by an average of 40.7% across 6 model–dataset combinations.
Inference-time Alignment in Continuous Space: This paper proposes Simple Energy Adaptation (SEA), which shifts the inference-time alignment paradigm from discrete-space search to continuous-space optimization. By performing gradient-based Langevin sampling over the continuous logit space, SEA approximates the optimal RLHF policy, achieving a 77.51% relative improvement over the strongest baseline on AdvBench and a 16.36% improvement on MATH.
Jailbreak-Zero: A Path to Pareto Optimal Red Teaming for Large Language Models: This paper proposes a policy-based (rather than example-based) evaluation framework for LLM red teaming, along with the Jailbreak-Zero method. By employing a simple large-scale parallel sampling strategy—requiring no manually crafted jailbreak tactics—the method achieves attack success rates of 99.5% on GPT-4o and 96.0% on Claude 3.5 on HarmBench, while attaining Pareto optimality across three objectives—coverage, diversity, and fidelity—through fine-tuning.
KL Penalty Control via Perturbation for Direct Preference Optimization: This paper proposes ε-DPO, which achieves instance-level adaptive KL penalty control by monitoring the monotonicity of logits—used as preference model outputs—under small perturbations of $\beta$ during training. The method incurs no additional computational overhead and significantly outperforms DPO and most direct alignment algorithms, achieving a 46.4% LC win rate on AlpacaEval 2 (vs. 40.3% for DPO).
LASeR: Learning to Adaptively Select Reward Models with Multi-Armed Bandits: This work frames the selection of multiple reward models (RMs) as a contextual multi-armed bandit (LinUCB) problem, adaptively choosing the most suitable RM for each training batch during iterative LLM training. LASeR comprehensively outperforms RM ensemble and single-RM baselines on reasoning, instruction-following, and long-context tasks, while achieving a 2–3× efficiency advantage.
Limited Preference Data? Learning Better Reward Model with Latent Space Synthesis: This paper proposes LENS, a framework that synthesizes preference data pairs in the latent space of LLM embeddings via a VAE, bypassing costly text generation and achieving substantial improvements in reward model performance at dramatically reduced computational cost (16,000× smaller model, 18× faster generation).
LLM Safety Alignment is Divergence Estimation in Disguise: This paper establishes a unified theoretical framework demonstrating that alignment methods such as RLHF, DPO, KTO, and BCO are essentially estimating the divergence between a safe distribution $\mathcal{D}^+$ and an unsafe distribution $\mathcal{D}^-$. This perspective explains the latent-space separation phenomenon observed after alignment. Building on this insight, the paper proposes KLDO, a KL divergence-based alignment method that achieves state-of-the-art robustness across 5 models.
LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization: LongVPO proposes a two-stage DPO framework. Stage 1 constructs pseudo-long-video preference data by anchoring short clips and introduces an anchor-only reference model approximation to address context-length mismatch. Stage 2 performs self-training on real long videos via recursive captioning and multi-clip reasoning tasks. Using only 16K synthetic samples, the method surpasses long-video models trained with large-scale supervised data.
Mechanism Design for LLM Fine-tuning with Multiple Reward Models: This paper formulates multi-party preference aggregation in RLHF fine-tuning as a mechanism design problem. It proves that under social-welfare-maximizing training rules, participants have incentives to misreport their preferences, and achieves dominant-strategy incentive compatibility (DSIC) via an extended VCG payment mechanism that ensures truthful reporting.
MetaDefense: Defending Finetuning-based Jailbreak Attack Before and During Generation: This paper proposes MetaDefense, a two-stage (pre-generation + mid-generation) defense framework that trains the LLM itself to predict the harmfulness of queries and partial responses, defending against finetuning-based jailbreak attacks without external classifiers, achieving 2× memory efficiency.
Mitigating Hallucination Through Theory-Consistent Symmetric Multimodal Preference Optimization: This paper proposes SymMPO (Symmetric Multimodal Preference Optimization), which addresses two key limitations of existing vision-augmented DPO methods—namely, theoretically unsound objective functions and indirect preference supervision—through symmetric paired preference learning over contrastive images and preference margin consistency regularization. Consistent performance gains are achieved across five hallucination benchmarks.
Multi-Environment POMDPs: Discrete Model Uncertainty Under Partial Observability: This paper systematically studies Multi-Environment POMDPs (ME-POMDPs)—a class of POMDP ensembles sharing state, action, and observation spaces but with arbitrarily different transition, observation, and reward functions—with the goal of finding a robust policy that maximizes reward under the worst-case environment. By introducing the Adversarial Belief POMDP (AB-POMDP) as a unified model and establishing its equivalence to one-sided partially observable stochastic games (POSGs), the paper proposes both exact (value iteration + LP) and approximate (AB-HSVI) algorithms.
On Extending Direct Preference Optimization to Accommodate Ties: This paper replaces the Bradley-Terry preference model in DPO with the Rao-Kupper and Davidson extensions, enabling preference optimization to explicitly model "tie" data. This avoids discarding ambiguous preference pairs and yields improved regularization and performance on translation and mathematical reasoning tasks.
ORPO-Distill: Mixed-Policy Preference Optimization for Cross-Architecture LLM Distillation: This paper proposes ORPO-Distill, which reformulates cross-architecture LLM knowledge distillation as a preference optimization problem. The teacher model generates positive reasoning chains while the student model generates negative ones; an ORPO contrastive loss is used for training, augmented by a mixed-policy update strategy for student negative samples. The method consistently outperforms black-box KD baselines across 5 QA benchmarks.
PolyJuice Makes It Real: Black-Box, Universal Red Teaming for Synthetic Image Detectors: This paper proposes PolyJuice, the first black-box, image-agnostic red teaming method for synthetic image detectors (SIDs). By discovering and exploiting a "realism direction" in the latent space of text-to-image (T2I) models, PolyJuice universally steers generated images to fool detectors, achieving an attack success rate of up to 84%.
Position: The Complexity of Perfect AI Alignment -- Formalizing the RLHF Trilemma: This paper formalizes the recurring safety–fairness–efficiency tensions in RLHF as an "alignment trilemma": it proves that no RLHF system can simultaneously satisfy $\varepsilon$-representativeness (faithfully reflecting diverse values), polynomial tractability (computational feasibility), and $\delta$-robustness (resistance to adversarial attacks), thereby providing a unified complexity-theoretic explanation for pathological phenomena such as preference collapse and sycophancy observed in current RLHF systems.
Preference Learning with Lie Detectors can Induce Honesty or Evasion: This paper systematically investigates the effects of integrating lie detectors into the LLM preference learning annotation pipeline (the SOLiD framework), finding that whether a trained model becomes genuinely honest or learns to evade detection depends on three key factors: the degree of exploration (GRPO vs. DPO), detector accuracy (TPR), and KL regularization strength.
Preference Optimization by Estimating the Ratio of the Data Distribution: This paper reinterprets DPO as a likelihood ratio (ratio matching) estimation problem and proposes BPO (Bregman Preference Optimization) under a Bregman divergence framework. BPO defines a generalized family of loss functions that subsumes DPO as a special case, and introduces the SBA (Scaled Basu's Power Divergence) instantiation, achieving a state-of-the-art 55.9% AlpacaEval2 length-controlled win rate on Llama-3-8B.
Provably Efficient Online RLHF with One-Pass Reward Modeling: This paper proposes a one-pass reward modeling method based on online mirror descent (OMD) that eliminates the computational bottleneck in online RLHF — namely, storing all historical data and re-optimizing from scratch at each iteration — achieving $\mathcal{O}(1)$ time and memory complexity per iteration while also improving upon MLE methods in statistical efficiency.
Reinforcement Learning Finetunes Small Subnetworks in Large Language Models: RL fine-tuning of LLMs updates only 5%–30% of parameters in practice (sparse subnetworks), and these subnetworks exhibit high consistency across different random seeds, datasets, and algorithms. Fine-tuning only the identified subnetwork can reproduce both the performance and the parameter values of full fine-tuning.
ResponseRank: Data-Efficient Reward Modeling through Preference Strength Learning: This paper proposes ResponseRank, a method that robustly learns utility differences by exploiting local relative differences in proxy signals of preference strength (e.g., response time and annotator agreement), significantly improving the sample efficiency of reward models.
Rethinking Direct Preference Optimization in Diffusion Models: To address two core issues in DPO for diffusion models — limited exploration and reward scale imbalance — this paper proposes a stable reference model update strategy and a timestep-aware training strategy, both of which can be integrated into various preference optimization algorithms.
Robust LLM Alignment via Distributionally Robust Direct Preference Optimization: This paper proposes two robust DPO variants—WDPO (Wasserstein) and KLDPO (KL divergence)—under a distributionally robust optimization (DRO) framework to address alignment failures caused by shifts in user preference distributions. The approach provides $O(n^{-1/4})$ convergence guarantees and achieves significant improvements over standard DPO on multi-dimensional alignment tasks and the OpenLLM leaderboard.
SafePTR: Token-Level Jailbreak Defense in Multimodal LLMs via Prune-then-Restore Mechanism: By analyzing the propagation mechanism of harmful tokens in multimodal LLMs, this work finds that fewer than 1% of tokens trigger jailbreak behavior in early-to-middle layers. Based on this finding, the training-free SafePTR framework is proposed, which prunes harmful tokens at vulnerable layers and restores benign features in subsequent layers, significantly improving safety without sacrificing task performance.
SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning: This work is the first to systematically apply the Constrained Markov Decision Process (CMDP) framework from Safe Reinforcement Learning (SafeRL) to safety alignment of Vision-Language-Action (VLA) models. Through a four-stage Integrated Safety Approach (ISA)—Model, Elicit, Constrain, and Assure—the method achieves an 83.58% reduction in safety violation costs on mobile manipulation tasks while maintaining task performance (+3.85%).
Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization: This paper proposes RRPO (Refined Regularized Preference Optimization), which replaces DPO's response-level rewards with subsequence-level fine-grained rewards and token-wise KL regularization. Combined with a self-alignment data generation framework, RRPO reduces hallucinations and improves temporal reasoning on video understanding tasks.
Short-length Adversarial Training Helps LLMs Defend Long-length Jailbreak Attacks: This paper theoretically proves and empirically validates that defending against suffix jailbreak attacks of length $\Theta(M)$ requires adversarial training on suffixes of only length $\Theta(\sqrt{M})$—i.e., "short adversarial training defends against long jailbreaks." Across five mainstream LLMs, adversarial training with 20-token suffixes reduces the attack success rate (ASR) of 120-token jailbreak attacks by at least 30%.
Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning: This paper identifies that reference model bias in NPO (Negative Preference Optimization) leads to uneven optimization power allocation across forget data and early-stage gradient weight smoothing failure. The proposed SimNPO eliminates reference model dependency and adopts length-normalized rewards, improving FQ from 0.79 to 0.99 on TOFU and consistently outperforming NPO across all benchmarks.
Strategyproof Reinforcement Learning from Human Feedback: This paper is the first to study strategic manipulation by annotators in RLHF from a mechanism design perspective. It proves a fundamental tradeoff between strategyproofness and policy alignment, and proposes the Pessimistic Median of MLEs algorithm to achieve approximate strategyproofness.
T-SHIRT: Token-Selective Hierarchical Data Selection for Instruction Tuning: This paper proposes T-SHIRT, a data selection framework that introduces Selective IFD (considering only informative tokens) and a hierarchical selection strategy (preferring samples with high neighborhood consistency). Fine-tuning on only 5% of data selected by T-SHIRT surpasses training on the full dataset, while the selection process requires only GPT-2 and 40 minutes on a single GPU.
Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons: Through a mechanistic interpretability lens, this work identifies a sparse set of "safety neurons" comprising approximately 5% of all neurons in LLMs. Patching only these neurons' activations recovers over 90% of safety performance, and the neuron-overlap perspective offers a mechanistic explanation for the alignment tax phenomenon.
Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training: This paper proposes TBA (Trajectory Balance with Asynchrony), which combines the GFlowNet Trajectory Balance (TB) objective with an asynchronous distributed RL architecture to decouple exploration and learning in LLM post-training, achieving 4–50× speedups without performance degradation across mathematical reasoning, preference fine-tuning, and automated red-teaming tasks.
Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning: TBRM minimizes trajectory-level Bellman residuals by treating LLM output logits as implicit Q-values, requiring only a single forward rollout per prompt during training. This yields substantially lower complexity than PPO/GRPO while achieving comparable or superior performance on mathematical reasoning benchmarks.
What Makes a Reward Model a Good Teacher? An Optimization Perspective: From an optimization-theoretic perspective, this paper proves that reward model accuracy alone is insufficient to measure its quality as an RLHF "teacher." Even a perfectly accurate reward model can lead to a flat RLHF objective landscape and extremely slow policy gradient optimization if the induced reward variance is too low. Moreover, different language models require different reward models.

🕸️ Graph Learning¶

Agint: Agentic Graph Compilation for Software Engineering Agents: This paper proposes Agint, a graph compiler that progressively compiles natural language intent into typed DAGs through a six-level type floor (TEXT→TYPED→SPEC→STUB→SHIM→PURE), paired with a hybrid JIT runtime and a Unix-style toolchain. This transforms AI code generation from brittle single-pass text prediction into a structured, parallelizable, and reproducible compilation process.
BLISS: Bandit Layer Importance Sampling Strategy for Efficient Training of Graph Neural Networks: This paper proposes BLISS, which formulates layer-wise neighbor sampling in GNNs as a multi-armed bandit problem. Using the EXP3 algorithm, it dynamically adjusts per-edge sampling probabilities with the variance contribution of neighbors to node representations as the reward signal, achieving accuracy on par with or exceeding full-batch training on GCN and GAT.
Deliberation on Priors: Trustworthy Reasoning of Large Language Models on Knowledge Graphs: This paper proposes DP (Deliberation on Priors), a framework that leverages structural priors from knowledge graphs via progressive knowledge distillation to generate faithful relational paths, and validates reasoning reliability through a reasoning introspection strategy based on constraint priors, achieving new state-of-the-art performance on KGQA benchmarks.
Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking: A systematic audit of 16 KGQA datasets reveals an average factual correctness of only 57% (WebQSP: 52%, MetaQA: 20%). The paper proposes KGQAGen, a framework that constructs high-quality multi-hop QA datasets via LLM-guided subgraph expansion and automatic SPARQL validation, yielding KGQAGen-10k with 96.3% accuracy. The study further demonstrates that the primary bottleneck in KG-RAG lies in retrieval rather than reasoning.
DuetGraph: Coarse-to-Fine Knowledge Graph Reasoning with Dual-Pathway Global-Local Fusion: DuetGraph proposes a dual-pathway (message passing + global attention) parallel fusion model and a coarse-to-fine reasoning optimization strategy. By separating rather than stacking local/global information processing, it mitigates score over-smoothing in KG reasoning, achieving SOTA on both inductive and transductive tasks with up to 8.7% MRR improvement and 1.8× training speedup.
Dynamic Bundling with Large Language Models for Zero-Shot Inference on Text-Attributed Graphs: DENSE proposes a "text bundling" strategy that packages textually and topologically/semantically similar nodes into bundles, queries LLMs for bundle-level labels, supervises GNN training via entropy-based and ranking-based losses, and dynamically refines bundles to exclude noisy nodes. It achieves comprehensive zero-shot inference improvements over GPT-4o and graph foundation models across 10 TAG datasets.
Elastic Weight Consolidation for Knowledge Graph Continual Learning: An Empirical Evaluation: This paper systematically evaluates Elastic Weight Consolidation (EWC) for continual learning of TransE knowledge graph embeddings on FB15k-237, finding that EWC reduces catastrophic forgetting from 12.62% to 6.85% (a 45.7% reduction), and reveals that task partitioning strategy (relation-based vs. random) has a substantial impact on forgetting metrics (a difference of 9.8 percentage points).
ESCA: Contextualizing Embodied Agents via Scene-Graph Generation: This paper proposes the ESCA framework, which provides structured visual understanding context for MLLM-driven embodied agents via open-vocabulary scene graph generation (the SGClip model), substantially reducing perception error rates and improving task completion rates.
FALCON: An ML Framework for Fully Automated Layout-Constrained Analog Circuit Design: FALCON proposes an end-to-end framework for automated analog/RF circuit design via a three-stage pipeline: MLP-based topology selection, edge-centric GNN performance prediction, and differentiable layout-constrained gradient inference. Trained on a million-scale Cadence simulation dataset, the framework achieves >99% topology selection accuracy, <10% performance prediction error, and sub-second per-instance inference.
FastJAM: a Fast Joint Alignment Model for Images: FastJAM is a graph-based fast joint image alignment method that computes pairwise keypoint correspondences using off-the-shelf image matchers, constructs a keypoint graph via fast non-parametric clustering, employs a GNN to propagate and aggregate information for predicting per-image homography parameters, and adopts an inverse-compositional loss to eliminate the need for regularization hyperparameters. It reduces joint alignment time from hours/minutes to approximately 49 seconds while achieving alignment quality superior to or on par with existing methods.
From Sequence to Structure: Uncovering Substructure Reasoning in Transformers: This paper presents empirical and theoretical analyses revealing how decoder-only Transformers understand graph structure from text sequences. It proposes "Induced Substructure Filtration" (ISF) to explain the layer-wise substructure identification mechanism, and extends this framework to validate consistency in LLMs, support compositional graph reasoning (Thinking-in-Substructures), and enable substructure extraction in attributed graphs (molecular graphs).
Generative Graph Pattern Machine: This paper proposes the Generative Graph Pattern Machine (G2PM), a fully message-passing-free generative Transformer framework for graph pre-training. It tokenizes graph instances (nodes/edges/graphs) into substructure sequences via random walks and performs self-supervised pre-training under a Masked Substructure Modeling objective. G2PM comprehensively outperforms existing graph pre-training methods on node/link/graph classification and cross-domain transfer tasks, while exhibiting model and data scaling laws analogous to those observed in NLP and CV.
Geometric Imbalance in Semi-Supervised Node Classification: This work formally introduces the concept of "geometric imbalance" in semi-supervised node classification for the first time—showing that message passing on class-imbalanced graphs causes minority-class nodes to exhibit geometric ambiguity in Riemannian manifold embedding spaces—and proposes the UNREAL framework to systematically address this issue via three modules: dual-path pseudo-label alignment, node reordering, and geometric imbalance sample discarding.
GFM-RAG: Graph Foundation Model for Retrieval Augmented Generation: This paper proposes GFM-RAG, the first graph foundation model-driven retrieval-augmented generation framework, which performs single-pass multi-hop reasoning over knowledge graphs via a query-dependent GNN. With only 8M parameters, GFM-RAG achieves zero-shot generalization to unseen datasets and substantially outperforms state-of-the-art methods on multi-hop QA retrieval benchmarks.
GnnXemplar: Exemplars to Explanations -- Natural Language Rules for Global GNN Interpretability: This paper proposes GnnXemplar, a framework grounded in the cognitive-science Exemplar Theory. It selects representative nodes (exemplars) in the GNN embedding space and employs an LLM with iterative self-refinement to generate natural-language Boolean rules, achieving global interpretability for node-classification GNNs on large-scale graphs.
Graph Neural Networks for Efficient AC Power Flow Prediction in Power Grids: This work models power networks as graph structures (buses as nodes, transmission lines as edges) and investigates four GNN architectures — GCN, GAT, SAGEConv, and GraphConv — for predicting AC power flow solutions (voltage magnitudes and phase angles). Experiments on IEEE 14/30/57/118-bus test systems demonstrate that GNNs can efficiently substitute traditional Newton-Raphson solvers.
Graph Neural Networks for Interferometer Simulations: This work presents the first application of graph neural networks to optical interferometer simulation, employing a GATv2 + KAN architecture to predict electromagnetic field power and spatial intensity distributions within LIGO interferometers. The approach achieves inference speeds up to 815× faster than the standard simulation software (FINESSE) while maintaining satisfactory physical accuracy.
Graph Persistence goes Spectral: This paper proposes SpectRe — a novel topological descriptor that incorporates graph Laplacian spectral information into persistent homology (PH) graphs. It proves that SpectRe is strictly more expressive than either PH or spectral methods alone, establishes a local stability theory, and demonstrates improved GNN graph classification performance on both synthetic and real-world datasets.
GraphFaaS: Serverless GNN Inference for Burst-Resilient, Real-Time Intrusion Detection: This paper proposes GraphFaaS, a serverless inference architecture for GNN-based intrusion detection. Through incremental provenance graph construction, feature-length-aware parallel node embedding, and greedy best-fit subgraph partitioning, GraphFaaS reduces mean detection latency from 14.16 seconds to 2.1 seconds (6.7×) and the coefficient of variation from 1.46 to 0.52 (64% reduction), maintaining stable low latency under bursty workloads without sacrificing detection accuracy.
GraphTOP: Graph Topology-Oriented Prompting for Graph Neural Networks: This paper proposes GraphTOP, the first graph topology-oriented prompting framework, which formulates topology-oriented prompting as an edge rewiring problem and relaxes it into a continuous space via Gumbel-Softmax. GraphTOP outperforms six baseline methods across five datasets and four pre-training strategies.
Heterogeneous Swarms: Jointly Optimizing Model Roles and Weights for Multi-LLM Systems: This paper proposes the Heterogeneous Swarms algorithm, which models multi-LLM systems as directed acyclic graphs (DAGs) and employs particle swarm optimization (PSO) to jointly optimize model roles (graph topology) and model weights, achieving an average improvement of 18.5% over 17 baselines across 12 tasks.
Interaction-Centric Knowledge Infusion and Transfer for Open-Vocabulary Scene Graph Generation: This paper proposes ACC, an interaction-centric framework that addresses the critical matching problem in open-vocabulary scene graph generation (OVSGG) by shifting from the conventional object-centric paradigm to an interaction-driven one. During the knowledge infusion stage, bidirectional interaction prompts are used to generate more accurate pseudo supervision; during the knowledge transfer stage, interaction-guided query selection and interaction-consistency knowledge distillation reduce mismatches. ACC achieves state-of-the-art performance on three benchmarks: VG, GQA, and PSG.
Learning Repetition-Invariant Representations for Polymer Informatics: This paper proposes GRIN (Graph Repetition-Invariant Network), which achieves invariance to the number of repeated monomer units in polymer representations via Max aggregation and a specialized graph construction strategy, addressing a fundamental symmetry problem in polymer representation learning.
Logical Expressiveness of Graph Neural Networks with Hierarchical Node Individualization: This paper proposes Hierarchical Ego GNNs (HEGNNs), which generalize subgraph GNNs through a hierarchical node individualization mechanism, forming a hierarchy of models with strictly increasing expressive power. On bounded-degree graphs, the paper proves that the distinguishing power of HEGNN node classifiers is equivalent to graded hybrid logic ($\mathcal{HL}_k$), thereby unifying the expressiveness analysis of various GNN variants.
Making Classic GNNs Strong Baselines Across Varying Homophily: A Smoothness-Generalization Perspective: This paper theoretically reveals the smoothness-generalization dilemma inherent in GNN message passing, and proposes the IGNN framework with three minimal design principles — separative neighborhood transformation, inceptive aggregation, and neighborhood relationship learning — to systematically alleviate this dilemma. IGNN achieves top performance among 30 baselines and demonstrates universality across both homophilic and heterophilic graphs.
Moscat: Mixture of Scope Experts at Test for Generalizing Deeper GNNs: Grounded in PAC-Bayes generalization theory, this paper proves that varying GNN depth induces generalization preference drift across node subgroups with different homophily levels. It proposes Moscat—a post-processing attention-gating model that adaptively fuses independently trained GNN experts of different depths at test time on a per-node basis—achieving significant improvements across diverse GNN architectures and datasets.
MoEMeta: Mixture-of-Experts Meta Learning for Few-Shot Relational Learning: This paper proposes MoEMeta, a framework that employs a Mixture-of-Experts model to learn globally shared relational prototypes for cross-task generalization, combined with a task-customized projection adaptation mechanism to capture local context, achieving state-of-the-art performance on three KG benchmarks.
Nonlinear Laplacians: Tunable Principal Component Analysis under Directional Prior Information: This paper proposes a nonlinear Laplacian spectral algorithm that fuses spectral information with directional prior information by adding a diagonal matrix—obtained by applying a nonlinear function $\sigma$ to the degree vector of the observation matrix $\bm{Y}$—to $\hat{\bm{Y}}$. The approach significantly reduces the signal detection threshold in the biased sparse PCA problem (from $\beta^*=1$ to approximately $0.76$).
OCN: Effectively Utilizing Higher-Order Common Neighbors for Better Link Prediction: This paper identifies redundancy and over-smoothing issues in higher-order common neighbors (CN) for link prediction, and proposes orthogonalization (Gram-Schmidt to remove inter-order linear dependence) combined with normalization (dividing by path count, a generalized resource allocation heuristic) as a solution. The method achieves an average improvement of 7.7% in HR@100 across 7 datasets, with a 13.3% gain on the DDI dataset.
Over-squashing in Spatiotemporal Graph Neural Networks: This paper provides the first formal treatment of over-squashing in spatiotemporal graph neural networks (STGNNs), uncovering a counterintuitive "temporal sink" phenomenon in causal convolutions—whereby the earliest timestep exerts the greatest influence on the final representation—and proves that time-and-space (T&S) and time-then-space (TTS) architectures are equivalent in terms of information bottlenecks, offering theoretical justification for the computationally efficient TTS design.
P-DRUM: Post-hoc Descriptor-based Residual Uncertainty Modeling for Machine Learning Potentials: This paper proposes P-DRUM, a simple and efficient post-hoc uncertainty quantification framework that leverages descriptors from a trained graph neural network potential to estimate prediction residuals as uncertainty proxies, requiring no modification to the original model architecture or training pipeline.
Practical Bayes-Optimal Membership Inference Attacks: This paper proposes BASE and G-BASE, two practical Bayes-optimal membership inference attack methods targeting i.i.d. data and graph-structured data, respectively, achieving theoretical optimality while substantially reducing computational cost.
PKD: Preference-driven Knowledge Distillation for Few-shot Node Classification: PKD is a framework that jointly leverages LLMs and multiple GNN teachers for few-shot node classification on text-attributed graphs. A GNN-preference node selector (GNS) uses KL divergence-based uncertainty to identify nodes requiring LLM annotation, while a node-preference GNN selector (NGS) employs RL to match each node with its optimal GNN teacher. PKD achieves consistent state-of-the-art performance across 9 datasets (e.g., Cornell 87% vs. baselines 59–82%).
Principled Data Augmentation for Learning to Solve Quadratic Programming Problems: This paper proposes a principled data augmentation framework based on affine transformations of the KKT system, generating optimality-preserving augmented instances for MPNN-based learning-to-optimize (L2O) on linear programming (LP) and quadratic programming (QP) tasks. Combined with contrastive pretraining, the framework substantially improves performance in data-scarce and out-of-distribution (OOD) generalization settings.
Reasoning Meets Representation: Envisioning Neuro-Symbolic Wireless Foundation Models: This paper proposes a vision framework that integrates the neuro-symbolic (NeSy) paradigm with Wireless Physical-layer Foundation Models (WPFMs)—employing WPFMs as a neural perception engine to generate RF embedding vectors, and an ontology-driven knowledge graph together with a differentiable logic layer as the symbolic reasoning component. The resulting system achieves interpretable, generalizable, and compliance-verifiable wireless AI, providing a concrete technical pathway toward AI-native 6G networks.
Relieving the Over-Aggregating Effect in Graph Transformers: This paper identifies the over-aggregating phenomenon in Graph Transformers—wherein a large number of nodes are aggregated with near-uniform attention scores, diluting critical information—and proposes Wideformer, which alleviates this issue via divided aggregation and guided attention. As a plug-and-play module, Wideformer consistently improves backbone model performance across 13 datasets.
ReMindRAG: Low-Cost LLM-Guided Knowledge Graph Traversal for Efficient RAG: This paper proposes ReMindRAG, a KG-RAG system that combines LLM-guided KG traversal (node exploration + exploitation) with a training-free memory replay mechanism. The system stores LLM traversal experience in edge embeddings, enabling significant reduction in LLM calls for subsequent similar queries (~50% cost reduction) while improving answer accuracy (5%–10% gain).
Self-Supervised Discovery of Neural Circuits in Spatially Patterned Neural Responses with Graph Neural Networks: A GNN-based self-supervised framework is proposed that infers latent synaptic connectivity via a structure learning module while simultaneously predicting future spiking activity via a spike prediction module. The approach substantially outperforms statistical inference baselines on both simulated ring attractor network data and real mouse head-direction cell recordings.
Sketch-Augmented Features Improve Learning Long-Range Dependencies in Graph Neural Networks: This paper proposes Sketched Random Features (SRF), which injects random kernel-space projections of node features into every layer of a standard message-passing GNN, simultaneously alleviating oversquashing, oversmoothing, and limited expressiveness, with rigorous theoretical guarantees and low computational overhead.
S'MoRE: Structural Mixture of Residual Experts for Parameter-Efficient LLM Fine-tuning: This paper proposes S'MoRE, a framework that organizes low-rank residual experts into a multi-layer tree structure and constructs token-specific "residual trees" via hierarchical routing, achieving exponentially growing structural flexibility with parameter counts comparable to LoRA, thereby substantially improving LLM fine-tuning performance.
Solar-GECO: Perovskite Solar Cell Property Prediction with Geometric-Aware Co-Attention: This paper proposes Solar-GECO, a multimodal framework that encodes the 3D crystal structure of the perovskite absorber layer via a geometric GNN and the remaining device layers via LLM text embeddings, fuses them through a co-attention module, and predicts power conversion efficiency (PCE) along with its uncertainty, reducing MAE from 3.066 to 2.936.
Spatio-Temporal Directed Graph Learning for Account Takeover Fraud Detection: This paper proposes ATLAS, a framework that reformulates account takeover (ATO) fraud detection as a node classification problem on spatio-temporal directed graphs. By constructing causal directed graphs via temporal windows and nearest-neighbor constraints, and combining lag-aware label propagation with a GraphSAGE encoder, ATLAS achieves a +6.38% AUC improvement and over 50% reduction in user friction on a production graph at Capital One with 100M nodes and 1B edges.
SPOT-Trip: Dual-Preference Driven Out-of-Town Trip Recommendation: This paper proposes SPOT-Trip, the first framework to systematically address out-of-town trip recommendation. By integrating knowledge graph-enhanced static preference learning, neural ODE-driven dynamic preference learning, and a preference fusion module, the framework achieves up to 17.01% improvement over state-of-the-art baselines on two real-world datasets.
SSTAG: Structure-Aware Self-Supervised Learning Method for Text-Attributed Graphs: This paper proposes SSTAG, which jointly distills complementary knowledge from LLMs and GNNs into a structure-aware MLP via dual knowledge distillation, and incorporates a memory bank mechanism to store prototype representations, enabling efficient and scalable cross-domain self-supervised pre-training on text-attributed graphs.
TAMI: Taming Heterogeneity in Temporal Interactions for Temporal Graph Link Prediction: This paper is the first to systematically identify the heterogeneity problem in temporal graph interactions (interaction intervals follow a power-law distribution), and proposes the TAMI framework comprising two modules—Log Time Encoding (LTE) and Link History Aggregation (LHA)—that can be seamlessly integrated into existing TGNNs, consistently improving link prediction performance across 16 datasets with gains of up to 87.05%.
The Underappreciated Power of Vision Models for Graph Structural Understanding: This paper reveals the severely underappreciated capability of vision models (ResNet/ViT/Swin, etc.) for graph structural understanding. By rendering graphs as images and processing them with visual encoders, these models significantly outperform GNNs in global topology perception and cross-scale generalization. The paper also introduces the GraphAbstract benchmark to systematically evaluate this finding.
Uncertain Knowledge Graph Completion via Semi-Supervised Confidence Distribution Learning: ssCDL converts triple confidence scores from scalars into Gaussian confidence distributions to capture supervisory signals from neighboring confidence values, and employs meta self-training to generate high-quality pseudo confidence labels for negatively sampled triples, thereby rebalancing the training data. The method significantly outperforms all baselines on both confidence prediction and link prediction for uncertain knowledge graph completion.
Unifying and Enhancing Graph Transformers via a Hierarchical Mask Framework: This paper proposes a unified hierarchical mask framework that reveals the equivalence between Graph Transformer architectures and attention masks, and introduces M3Dphormer, which achieves efficient adaptive modeling of local/cluster/global interactions via multi-level masks, bi-level expert routing, and a dual attention computation scheme, achieving state-of-the-art results on 9 benchmarks.
Unifying Text Semantics and Graph Structures for Temporal Text-attributed Graphs with LLMs: This paper proposes the Cross framework, which employs LLMs to dynamically summarize the semantic evolution of node neighborhoods at strategically sampled temporal points (Temporal Reasoning Chain), then bidirectionally fuses text semantics and graph structural temporal information via a semantic-structural co-encoder. The approach achieves an average MRR improvement of 24.7% on temporal link prediction and a 3.7% AUC gain on an industrial dataset (WeChat).
Wavy Transformer: This paper establishes a formal equivalence between Transformer attention layers and graph neural diffusion on complete graphs, and proposes the Wavy Transformer based on second-order wave equations. By exploiting energy conservation properties, the method mitigates over-smoothing in deep Transformers and achieves consistent improvements across NLP, CV, and sparse graph tasks.
What Expressivity Theory Misses: Message Passing Complexity for GNNs: This paper critiques the binary expressivity theory of GNNs for its inability to explain practical performance differences, and proposes MPC—a continuous, task-specific complexity measure grounded in probabilistic lossyWL—achieving a Spearman correlation of -1 with accuracy (versus ρ = 0 for classical WLC), and successfully explaining why GCN with virtual nodes outperforms higher-expressivity higher-order models on long-range tasks.
When No Paths Lead to Rome: Benchmarking Systematic Neural Relational Reasoning: This paper introduces the NoRA benchmark, which systematically breaks the assumption underlying existing relational reasoning benchmarks that "reasoning can be reduced to path composition." By introducing off-path reasoning, ambiguous facts, and multi-relational settings, it reveals fundamental deficiencies in all existing models—including o3—on off-path reasoning.

💬 LLM / NLP¶

AceSearcher: Bootstrapping Reasoning and Search for LLMs via Reinforced Self-Play: This paper proposes AceSearcher—a collaborative self-play framework in which a single LLM simultaneously plays two roles: a decomposer (breaking complex queries into sub-questions to guide retrieval) and a solver (integrating retrieved context to generate answers). Through a two-stage training pipeline of SFT followed by iterative DPO, using only final-answer rewards, AceSearcher achieves an average EM improvement of 7.6% across 10 datasets, and the 32B model matches DeepSeek-V3 with fewer than 5% of its parameters.
Adaptive Kernel Design for Bayesian Optimization Is a Piece of CAKE with LLMs: This paper proposes CAKE (Context-Aware Kernel Evolution), which leverages LLMs as crossover and mutation operators within a genetic algorithm framework to adaptively generate and evolve GP kernel expressions during Bayesian optimization. Combined with the BAKER ranking mechanism that balances model fit (BIC) and expected improvement (EI), CAKE consistently outperforms both fixed-kernel and adaptive-kernel baselines on tasks including hyperparameter optimization, controller tuning, and photonic chip design.
Are Language Models Efficient Reasoners? A Perspective from Logic Programming: This paper proposes a framework for evaluating LLM reasoning efficiency (rather than correctness alone) from a logic programming perspective. By mapping natural language proofs to logic program proofs via verbalized logic programs, the authors find that current LLMs not only suffer accuracy degradation on math problems containing irrelevant axioms, but also exhibit severely inefficient reasoning—more than half of all reasoning steps are unnecessary.
AutoDiscovery: Open-ended Scientific Discovery via Bayesian Surprise: AutoDiscovery proposes Bayesian Surprise as an objective reward signal for open-ended scientific discovery — estimating the KL divergence between prior and posterior belief distributions via LLM sampling, combined with MCTS and progressive widening to explore the hypothesis space. On 21 real-world datasets, the method produces 5–29% more surprising discoveries than greedy/beam search baselines. Human evaluation confirms that Bayesian Surprise aligns with expert "surprise" ratings (0.67), substantially outperforming LLM self-evaluated "novelty" and "usefulness."
C²Prompt: Class-aware Client Knowledge Interaction for Federated Continual Learning: To address class-level knowledge inconsistency during prompt communication in federated continual learning, C²Prompt is proposed, which explicitly enhances class-level knowledge coherence across clients via two mechanisms: Local Class Distribution Compensation (LCDC) and Class-aware Prompt Aggregation (CPA). The method achieves an Avg accuracy of 87.20% on ImageNet-R, surpassing the previous SOTA Powder by 2.51%.
CAT: Circular-Convolutional Attention for Sub-Quadratic Transformers: CAT replaces the $N \times N$ attention matrix in standard self-attention with a circulant matrix generated from an $N$-dimensional vector, leveraging FFT to achieve $O(N \log N)$ attention computation. While strictly preserving the softmax row-normalization structure, CAT matches or surpasses standard attention on ImageNet-1k (avg pool, CLIP-L accuracy 0.694 vs. 0.646) and WikiText-103 masked LM (PPL 8.32 vs. 9.82).
Characterizing the Expressivity of Fixed-Precision Transformer Language Models: This work precisely characterizes the expressive power of fixed-precision, strictly causal, soft-attention, NoPE Transformers — showing it is exactly equivalent to linear temporal logic restricted to past operators, LTL[P] — and unifies this characterization with partially ordered deterministic finite automata (PODFA) and $\mathcal{R}$-trivial monoids.
Composing Linear Layers from Irreducibles: By leveraging Clifford algebra, this work represents linear layers as compositions of bivectors—specifically as rotor sandwich products—requiring only $O(\log^2 d)$ parameters to replace a $d \times d$ dense matrix. When applied to Q/K/V projections in LLM attention layers, performance closely matches the original model and strong baselines.
Cultural Alien Sampler: Open-ended Art Generation Balancing Originality and Coherence: This paper proposes the Cultural Alien Sampler (CAS), which employs two GPT-2 models to separately model "concept coherence" and "cultural typicality," selecting concept combinations with high coherence but low cultural typicality to generate original yet harmonious artistic ideas. In human evaluations, CAS approaches the level of art students and substantially outperforms GPT-4o.
Detecting High-Stakes Interactions with Activation Probes: Linear activation probes (lightweight classifiers trained on LLM internal representations) are used to detect "high-stakes interactions" from users. Trained on synthetic data, these probes achieve AUROC of 0.88–0.92 across 6 real-world datasets, matching fine-tuned 8–12B LLMs at a computational cost six orders of magnitude lower. A cascaded architecture (probe pre-filtering + LLM refinement) further surpasses either component used alone.
Don't Be Lazy: CompleteP Enables Compute-Efficient Deep Transformers: CompleteP parameterization (α=1) is the only scheme that simultaneously achieves hyperparameter transfer along the depth dimension and complete feature learning, saving 12–34% FLOPs over μP on deep models.
EnCompass: Enhancing Agent Programming with Search Over Program Execution Paths: This paper proposes the Probabilistic Angelic Nondeterminism (PAN) programming model and the EnCompass Python framework, which decouple an agent's core workflow logic from its inference-time search strategy. Programmers only need to insert branchpoint() markers at LLM call sites and switch among best-of-N, beam search, tree search, and other strategies via a few configuration parameters, reducing the amount of code modification by 3–6×.
EvoRefuse: Evaluating and Mitigating LLM Over-Refusal via Evolutionary Prompt Optimization: This paper proposes EvoRefuse, a framework that employs evolutionary search to maximize the ELBO for automatically generating diverse pseudo-malicious instructions, yielding a more challenging over-refusal evaluation benchmark (EvoRefuse-Test) and an effective alignment mitigation dataset (EvoRefuse-Align).
GeoCAD: Local Geometry-Controllable CAD Generation with Large Language Models: GeoCAD is proposed as the first method for locally geometry-controllable CAD generation. It introduces a complementary captioning strategy to generate geometric instructions for local parts and fine-tunes an LLM to enable precise modification of local CAD components according to user-defined text instructions.
Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales: This paper investigates the hyperparameter scaling rules for matrix-preconditioned optimizers (Shampoo/SOAP/Muon) with respect to model width and depth under the μP framework, and demonstrates that correct hyperparameter scaling is the key to achieving consistent speedups. Using μP with $1/\text{width}$ weight decay, all three optimizers consistently achieve approximately $1.4\times$ speedup on Llama models ranging from 190M to 1.4B parameters.
In-Context Learning of Linear Dynamical Systems with Transformers: Approximation Bounds and Depth-Separation: This paper analyzes the ICL approximation capability of linear Transformers on noisy linear dynamical systems: $O(\log T)$ depth suffices to achieve $O(\log T / T)$ test error (approaching the least-squares estimator), while single-layer linear Transformers admit an irreducible lower bound — revealing a depth-separation phenomenon under non-IID data.
Large Language Models Miss the Multi-Agent Mark: This position paper systematically surveys 1,400+ papers to argue that current LLM-based multi-agent systems (MAS LLMs) deviate from foundational MAS theory along four dimensions: LLMs lack native social behavior, environment design is LLM-centric, asynchronous coordination and standard communication protocols are absent, and emergent behaviors lack quantification. The paper warns that the field risks reinventing the wheel while ignoring 40 years of MAS research.
Linear Transformers Implicitly Discover Unified Numerical Algorithms: After training linear Transformers on a masked block matrix completion task, algebraic analysis of the learned weights reveals that the models implicitly converge to the same two-line iterative update rule—EAGLE—under three fundamentally different computational constraints (centralized, distributed, and compute-limited). This rule achieves second-order convergence with only logarithmic dependence on the condition number.
MonarchAttention: Zero-Shot Conversion to Fast, Hardware-Aware Structured Attention: This paper proposes MonarchAttention, which leverages the structured properties of Monarch matrices and employs alternating optimization over a variational form of softmax to approximate attention at $\Theta(N\sqrt{N}d)$ complexity. The method enables zero-shot replacement of attention layers in pretrained Transformers without any additional training, while achieving 1.4×–8.2× speedups over FlashAttention-2 on GPU.
MOOSE-Chem2: Exploring LLM Limits in Fine-Grained Scientific Hypothesis Discovery: This work formalizes fine-grained scientific hypothesis generation as a combinatorial optimization problem and proposes Hierarchical Heuristic Search (HHS)—using LLM pairwise comparisons as gradient signals to navigate the hypothesis space, with hierarchical abstraction smoothing the reward landscape to reduce local optima entrapment. On an expert-annotated benchmark of 51 post-2024 chemistry papers, Soft Recall improves from 19.99% to 40.35%.
msf-CNN: Patch-based Multi-Stage Fusion with Convolutional Neural Networks for TinyML: This paper proposes msf-CNN, a multi-stage patch-based fusion optimization technique based on a directed acyclic graph (DAG) shortest-path algorithm. By efficiently searching for the optimal fusion configuration of CNNs, msf-CNN achieves 50%–87% reduction in peak RAM usage compared to existing methods (MCUNetV2, StreamNet) across various microcontrollers (ARM Cortex-M, RISC-V, ESP32), while maintaining controllable computational overhead.
Nemotron-Flash: Towards Latency-Optimal Hybrid Small Language Models: Nemotron-Flash constructs a latency-optimal family of small language models through systematic optimization of depth-width ratios, evolutionary search over hybrid operator combinations (DeltaNet + Mamba2 + Attention), and weight-normalization-based training. Compared to Qwen3-1.7B/0.6B, it achieves 1.3×/1.9× latency reduction alongside a +5.5% average accuracy improvement.
On the Role of Hidden States of Modern Hopfield Network in Transformer: This paper moves beyond the adiabatic approximation underlying the established correspondence between Modern Hopfield Networks (MHN) and Transformers. By retaining the hidden-state dynamics of MHN, it derives a novel attention mechanism—Modern Hopfield Attention (MHA)—that introduces a cross-layer propagation mechanism for attention scores within self-attention layers. MHA improves the performance of ViT and GPT-2 systematically without adding any trainable parameters, and both theoretically and empirically demonstrates that it effectively alleviates the rank collapse problem in deep Transformers.
Opinion Maximization in Social Networks by Modifying Internal Opinions: This paper studies the optimization problem of maximizing the overall opinion in a social network by modifying the internal opinions of $k$ key nodes. Two sampling-based approximation algorithms (random walk and forest sampling) and one exact asynchronous algorithm MIS are proposed. MIS provides theoretical convergence guarantees to the optimal solution and demonstrates superior efficiency and accuracy on real-world networks with tens of millions of nodes.
Planning without Search: Refining Frontier LLMs with Offline Goal-Conditioned RL: This paper proposes PNLC, a method that trains a lightweight goal-conditioned value function as a "natural language critic" to guide LLM agents in multi-turn planning and self-refinement at the thought-step level. Without direct fine-tuning or inference-time search, PNLC significantly outperforms existing methods on complex interactive tasks such as web navigation, social reasoning, and persuasion, while achieving 8–10× faster inference.
PluralisticBehaviorSuite: Stress-Testing Multi-Turn Adherence to Custom Behavioral Policies: This paper introduces PBSuite, an evaluation suite comprising 300 industry-specific behavioral policies and a dynamic multi-turn adversarial evaluation framework. It reveals that mainstream LLMs exhibit high compliance under single-turn settings (violation rate <4%), but compliance degrades sharply under multi-turn adversarial interactions (violation rate up to 84%).
Polar Sparsity: High Throughput Batched LLM Inferencing with Scalable Contextual Sparsity: This paper reveals a "polarity shift" phenomenon in LLM inference sparsity — MLP layer sparsity vanishes as batch size increases, while attention head sparsity remains stable and batch-invariant. Based on this insight, the authors design Selective Head Attention and corresponding GPU kernels, achieving up to 2.2× end-to-end speedup in large-batch inference.
Post Hoc Regression Refinement via Pairwise Rankings: This paper proposes RankRefine, a model-agnostic post-processing regression refinement method that fuses predictions from a base regressor with estimates derived from pairwise rankings via inverse-variance weighting. Without any retraining, the method achieves up to 10% relative MAE reduction in molecular property prediction using only 20 pairwise comparisons and a general-purpose LLM.
PRESTO: Preimage-Informed Instruction Optimization for Prompting Black-Box LLMs: This paper proposes PRESTO, a framework that exploits the many-to-one mapping (preimage structure) from soft prompts to instructions in white-box LLMs. Through three components — score sharing, preimage-based initialization, and score consistency regularization — PRESTO equivalently obtains 14× labeled data under the same query budget, significantly improving instruction optimization efficiency for black-box LLMs.
Q♯: Provably Optimal Distributional RL for LLM Post-Training: This paper proposes Q♯, a distributional RL-based value function method for KL-regularized LLM post-training. By learning the cumulative reward distribution under the reference policy to compute the optimal soft Q-function for guided generation, Q♯ achieves higher accuracy and lower KL divergence on mathematical reasoning tasks, and provides a variance-dependent PAC convergence bound.
Reparameterized LLM Training via Orthogonal Equivalence Transformation: This paper proposes POET, a training framework that reparameterizes weight matrices as the product of two learnable orthogonal matrices and a fixed random weight matrix, thereby preserving spectral properties throughout training to achieve more stable optimization and improved generalization with fewer trainable parameters than AdamW.
Scaling Up Active Testing to Large Language Models: By introducing three key simplifications—constructing a fixed surrogate model via in-context learning, using a small surrogate model to evaluate a large target model, and eliminating the need for target model predictions during data acquisition—this work scales active testing to LLMs, reducing risk estimation error by 25%–80% relative to random sampling.
SolverLLM: Solving Optimization Problems via Test-Time Scaling with LLM-Guided Search: This paper proposes SolverLLM, a training-free framework that treats the mathematical modeling of optimization problems as a search problem. It employs an enhanced MCTS to explore optimal formulations within a six-element representation space, incorporating dynamic expansion, prompt backpropagation, and uncertainty backpropagation. SolverLLM surpasses both prompting-based and fine-tuning-based methods on 6 benchmarks without any training.
Solving Inequality Proofs with Large Language Models: This paper proposes IneqMath, the first large-scale olympiad-level inequality benchmark, formulates inequality proving as two automatically verifiable subtasks (bound estimation and relation prediction), develops a five-module LLM-as-Judge framework, and finds that even o1 achieves an overall accuracy below 10% under step-by-step reasoning scrutiny.
SPACE: Noise Contrastive Estimation Stabilizes Self-Play Fine-Tuning for Large Language Models: This paper proposes SPACE (Self-PlAy via Noise Contrastive Estimation), which incorporates noise contrastive estimation into self-play fine-tuning. By independently optimizing the absolute reward values of real and synthetic samples—rather than their relative margin—SPACE fundamentally resolves the unstable convergence issues of methods such as SPIN, and provides provable convergence guarantees.
Sparse MeZO: Less Parameters for Better Performance in Zeroth-Order LLM Fine-Tuning: This paper proposes Sparse MeZO (S-MeZO), motivated by the observation that zeroth-order gradient noise disproportionately affects parameters with large magnitudes. S-MeZO selectively applies zeroth-order perturbation and updates only to small-magnitude parameters, achieving significant performance gains (+9% on RTE) and convergence acceleration (3.5×) without any additional memory overhead.
Spectral Conditioning of Attention Improves Transformer Performance: This paper theoretically establishes that the condition number of the attention layer Jacobian in Transformers is governed by the condition numbers of the Query/Key/Value matrices, and proposes Spectral Conditioned Attention — a plug-and-play module that reduces the condition number by adding fixed correction terms to the Q/K/V matrices, consistently improving performance across image classification, object detection, and NLP tasks.
SubSpec: Speculate Deep and Accurate — Lossless and Training-Free Acceleration for Offloaded LLMs: This paper proposes SubSpec, a plug-and-play lossless and training-free acceleration method for offloaded LLMs. The core idea is to construct a highly aligned quantized substitute draft model directly from the offloaded target model itself, and to maximize alignment by sharing GPU-resident layers and KV-Cache. SubSpec achieves a 9.1× speedup for Qwen2.5 7B under an 8GB VRAM budget and a 12.5× speedup for Qwen2.5 32B under 24GB VRAM.
Strassen Attention, Split VC Dimension and Compositionality in Transformers: This paper introduces the Splitting VC dimension as a theoretical tool to prove fundamental limitations of single-layer softmax Transformers (even with infinite precision) on compositional reasoning tasks, and proposes the Strassen attention mechanism with sub-cubic time complexity to overcome these limitations.
StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Model: StreamBridge proposes a simple and generalizable framework that enables multi-turn streaming interaction via a memory buffer with round-decayed compression, and achieves proactive response through a decoupled lightweight activation model. Combined with the purpose-built Stream-IT dataset, it successfully converts offline Video-LLMs (e.g., Qwen2-VL, LLaVA-OV) into streaming assistants, surpassing GPT-4o and Gemini 1.5 Pro on OVO-Bench and Streaming-Bench.
SYMPHONY: Synergistic Multi-agent Planning with Heterogeneous Language Model Assemblies: This paper proposes SYMPHONY, an MCTS-based multi-agent planning framework that leverages diversity-driven search over a heterogeneous LLM pool, UCB-based adaptive scheduling, entropy-modulated confidence scoring, and pool-level memory sharing to substantially improve planning diversity and efficiency.
Synergy over Discrepancy: A Partition-Based Approach to Multi-Domain LLM Fine-Tuning: This paper proposes a partition-based multi-stage fine-tuning framework that strategically partitions multiple domains into subsets (stages) to maximize inter-domain synergy while minimizing negative transfer, and derives a novel generalization bound to theoretically support the partitioning strategy.
System Prompt Optimization with Meta-Learning: This paper formulates system prompt optimization as a bilevel problem and proposes MetaSPO, a meta-learning framework that optimizes system prompts for cross-task generalization in the outer loop while optimizing task-specific user prompts in the inner loop. The resulting system prompts significantly outperform baselines across 14 unseen tasks.
Systematizing LLM Persona Design: A Four-Quadrant Technical Taxonomy for AI Companions: This paper proposes a four-quadrant technical taxonomy for LLM persona design, organized along two axes—"virtual vs. embodied" and "emotional companionship vs. functional augmentation"—to systematically analyze the technology stacks, core challenges, and ethical risks across diverse scenarios ranging from virtual companions and game NPCs to caregiving robots.
The Rise of Parameter Specialization for Knowledge Storage in Large Language Models: This paper systematically analyzes 20 open-source LLMs and finds that stronger models exhibit higher degrees of parameter specialization in MLP value vectors — i.e., semantically similar knowledge tends to be concentrated in a small subset of parameter vectors. Causal experiments further confirm a causal relationship between this specialization degree and model performance on knowledge tasks.
Triplets Better Than Pairs: Towards Stable and Effective Self-Play Fine-Tuning for LLMs: This paper proposes T-SPIN (Triplet Self-Play Fine-Tuning), which extends SPIN by introducing a "historical advantage" (proto-synthetic responses as anchor points) and an entropy constraint to enable reference-free policy training. T-SPIN addresses two core issues in SPIN: optimization instability and train-generation misalignment, achieving performance comparable to full-data SFT using only 25% of labeled data.
Unifying Attention Heads and Task Vectors via Hidden State Geometry in In-Context Learning: This paper proposes a unified framework based on hidden state geometry (separability + alignment) that bridges the two major explanatory lines of ICL — attention heads (PTH/IH) and task vectors — revealing a two-phase mechanism in classification tasks: early layers establish separability via PTH, while later layers improve alignment with label unembedding directions via IH.
Valid Inference with Imperfect Synthetic Data: A hyperparameter-free framework based on Generalized Method of Moments (GMM) is proposed to integrate imperfect LLM-generated synthetic data with real data for statistically valid inference. When the residuals of synthetic data are correlated with those of real data, the framework can substantially reduce estimation variance, while guaranteeing no harm to estimation quality in the worst case (i.e., when synthetic data is entirely uninformative).
Weak-to-Strong Generalization under Distribution Shifts: This paper demonstrates that naive weak-to-strong generalization fails under distribution shifts—where the strong model performs even worse than the weak supervisor—and proposes RAVEN, a framework that dynamically learns optimal combination weights over multiple weak models to achieve robust weak-to-strong generalization, surpassing baselines by over 30% on OOD tasks.
What One Cannot, Two Can: Two-Layer Transformers Provably Represent Induction Heads on Any-Order Markov Chains: This paper theoretically proves that a two-layer single-head Transformer suffices to represent the conditional $k$-gram model (i.e., $k$-th order induction head) for any $k$-th order Markov process, establishing the tightest known characterization of the relationship between Transformer depth and Markov order. The key insight is leveraging ReLU and LayerNorm nonlinearities in the MLP to compensate for the reduced number of layers.
Wider or Deeper: Scaling LLM Inference-Time Compute with Adaptive Branching Tree Search: AB-MCTS proposes an adaptive-branching Monte Carlo Tree Search framework that dynamically decides at each node whether to go "wider" (generate new candidate answers) or "deeper" (refine existing answers using feedback), balancing exploration and exploitation via Bayesian posterior updates, and outperforms repeated sampling and standard MCTS on programming and engineering tasks.
Writing in Symbiosis: Mapping Human Creative Agency in the AI Era: Through longitudinal corpus analysis of 50,000+ documents, this paper proposes the "Dual-Track Evolution" hypothesis — that in the LLM era, human writing exhibits thematic convergence alongside structural stylistic differentiation — and identifies three authorial adaptation archetypes: Adopters, Resistors, and Pragmatists.

🚗 Autonomous Driving¶

3EED: Ground Everything Everywhere in 3D: This paper introduces 3EED — the first large-scale multi-platform (vehicle, drone, quadruped robot), multimodal (LiDAR + RGB) outdoor 3D visual grounding benchmark, containing over 128K objects and 22K language descriptions, making it 10× larger than existing outdoor datasets. A baseline method incorporating cross-platform alignment, multi-scale sampling, and scale-adaptive fusion is also proposed, revealing substantial performance gaps in cross-platform 3D grounding.
Aha: Predicting What Matters Next — Online Highlight Detection Without Looking Ahead: Aha proposes the first autoregressive framework for Online Highlight Detection (OHD), featuring a decoupled multi-objective prediction head (relevance / informativeness / uncertainty) and a novel Dynamic SinkCache memory mechanism. Under strict causal constraints with no access to future frames, Aha surpasses prior offline methods on TVSum and Mr.Hisum benchmarks by +5.9% and +8.3% mAP, respectively.
AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning: AutoVLA integrates physical action tokens directly into a pretrained VLM (Qwen2.5-VL-3B), equips the model with fast/slow dual-thinking modes via SFT, and applies GRPO reinforcement fine-tuning to enable adaptive reasoning switching and optimize planning performance. The approach achieves competitive end-to-end driving performance across four major autonomous driving benchmarks: nuPlan, Waymo, nuScenes, and CARLA.
Availability-aware Sensor Fusion via Unified Canonical Space: This paper proposes ASF (Availability-aware Sensor Fusion), which maps Camera/LiDAR/4D Radar features into a shared space via Unified Canonical Projection (UCP), applies cross-sensor along-patch cross-attention (CASAP, complexity $O(N_qN_s)$ vs. $O(N_qN_sN_p)$) to automatically adapt to available sensors, and employs a Sensor Combination Loss (SCL) covering all 7 sensor subsets. ASF achieves AP_3D of 73.6% on K-Radar (surpassing SOTA by 20.1%), with only a 1.7% performance drop under sensor failure.
BayesG: Bayesian Ego-Graph Inference for Networked Multi-Agent Reinforcement Learning: BayesG enables each agent in networked MARL to learn the dynamic structure of its local communication graph via Bayesian variational inference — sampling edge masks with Gumbel-Softmax and jointly optimizing policy and graph structure under an ELBO objective — achieving 50%+ reward improvement over the best baseline in a 167-agent New York traffic scenario.
Causality Meets Locality: Provably Generalizable and Scalable Policy Learning for Networked Systems: This paper proposes the GSAC framework, which integrates causal representation learning with meta Actor-Critic. By learning sparse causal masks from networked MARL to construct Approximate Compact Representations (ACR), GSAC achieves scalability; by conditioning policies on domain factors, it achieves cross-domain generalization. Finite-sample guarantees are provided for causal recovery, convergence, and adaptation gap.
ChronoGraph: A Real-World Graph-Based Multivariate Time Series Dataset: This paper presents ChronoGraph — the first real-world microservice dataset that simultaneously provides multivariate time series, explicit service dependency graphs, and event-level anomaly labels (6 months / ~700 services / 5-dimensional metrics / 8005 timesteps). Benchmark results reveal substantial room for improvement in long-horizon forecasting and topology-aware modeling among existing methods.
Continuous Simplicial Neural Networks: This paper proposes COSIMO, the first continuous simplicial neural network based on partial differential equations (PDEs), which realizes continuous information flow by defining heat diffusion dynamics on the Hodge Laplacian. COSIMO demonstrates superior stability and over-smoothing control compared to discrete SNNs.
CuMoLoS-MAE: A Masked Autoencoder for Remote Sensing Data Reconstruction: This paper proposes CuMoLoS-MAE, a Masked Autoencoder combining a curriculum masking strategy with Monte Carlo stochastic ensemble inference for high-fidelity reconstruction and pixel-wise uncertainty quantification of remote sensing atmospheric profile data.
CymbaDiff: Structured Spatial Diffusion for Sketch-based 3D Semantic Urban Scene Generation: This work introduces the first sketch-to-3D outdoor semantic scene generation task along with a benchmark dataset, SketchSem3D, and proposes CymbaDiff (Cylinder Mamba Diffusion), a denoising network that achieves structured spatial modeling via dual-path Mamba blocks combining cylindrical and Cartesian scanning. CymbaDiff reduces FID by 75% over 3D Latent Diffusion and 71% over 3D DiT.
DBLoss: Decomposition-based Loss Function for Time Series Forecasting: This paper proposes DBLoss—a general-purpose loss function based on exponential moving average (EMA) decomposition. During loss computation, both predictions and ground-truth values are decomposed into seasonal and trend components within the forecasting horizon, and losses are computed separately for each component. DBLoss serves as a plug-and-play replacement for MSE and consistently improves any deep learning forecasting model, with effectiveness validated across 8 benchmark datasets × 8 SOTA models.
DINO-Foresight: Looking into the Future with DINO: This paper proposes DINO-Foresight, which forecasts future-frame feature evolution within the semantic feature space of a Vision Foundation Model (VFM). A self-supervised Masked Feature Transformer predicts PCA-compressed representations of multi-layer DINOv2 features. Paired with plug-and-play task-specific heads, a single model simultaneously handles semantic segmentation, instance segmentation, depth estimation, and surface normal prediction, substantially outperforming the VISTA world model while achieving 100× faster inference.
DriveDPO: Policy Learning via Safety DPO For End-to-End Autonomous Driving: DriveDPO is a two-stage framework that first fuses human-imitation similarity and rule-based safety scores into a single supervised distribution via unified policy distillation, then applies Safety DPO to construct trajectory preference pairs of the form "human-like but unsafe vs. human-like and safe" for policy fine-tuning — achieving a new state-of-the-art PDMS of 90.0 on NAVSIM.
Extremely Simple Multimodal Outlier Synthesis for Out-of-Distribution Detection and Segmentation: This paper proposes Feature Mixing — an extremely simple multimodal outlier synthesis method that generates OOD samples by randomly swapping $N$ dimensions across features from two modalities for training regularization. It provides theoretical guarantees that synthesized outliers reside in low-likelihood regions of the ID distribution with bounded deviation, achieves state-of-the-art performance across 8 datasets and 4 modality combinations, and runs 10×–370× faster than NP-Mix.
Flow Matching-Based Autonomous Driving Planning with Advanced Interactive Behavior Modeling: This paper proposes Flow Planner—a system combining three synergistic innovations: fine-grained trajectory tokenization, an interaction-enhanced spatiotemporal fusion architecture, and flow matching with classifier-free guidance. It is the first purely learning-based method to surpass 90 points on nuPlan Val14 (90.43), and outperforms Diffusion Planner by 8.92 points on the interaction-intensive interPlan benchmark.
Future-Aware End-to-End Driving: Bidirectional Modeling of Trajectory Planning and Scene Evolution: Proposes SeerDrive, which achieves SOTA on NAVSIM and nuScenes through bidirectional modeling of scene evolution and trajectory planning (future-aware planning + iterative interaction).
FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving: FSDrive enables VLAs to "think visually" — first acting as a world model to generate a unified visual CoT frame that integrates future lane lines, 3D detection boxes, and scene predictions, then acting as an inverse dynamics model to perform trajectory planning based on current observations and the visual CoT. This approach activates the visual generation capability of MLLMs using only a minimal amount of data (~0.3%).
GSAlign: Geometric and Semantic Alignment Network for Aerial-Ground Person Re-Identification: This paper proposes GSAlign, a framework that addresses geometric distortion and semantic misalignment in aerial-ground person re-identification (AG-ReID) via a Learnable Thin Plate Spline (LTPS) module and a Dynamic Alignment Module (DAM), achieving +18.8% mAP and +16.8% Rank-1 improvements on the CARGO dataset under the aerial-ground protocol.
HoloLLM: Multisensory Foundation Model for Language-Grounded Human Sensing and Reasoning: This paper proposes HoloLLM, the first framework to integrate rare sensing modalities — including LiDAR, infrared, mmWave radar, and WiFi — into a multimodal large language model (MLLM). Through a Universal Modality-Injection Projector (UMIP), HoloLLM achieves efficient alignment between sensing modalities and text under data-scarce conditions, improving human action QA and captioning by approximately 30% over existing MLLMs.
How Different from the Past? Spatio-Temporal Time Series Forecasting with Self-Supervised Deviation Learning: This paper proposes ST-SSDL, a framework that captures dynamic deviations between current inputs and historical patterns via self-supervised deviation learning (SSDL). It discretizes the latent space using learnable prototypes and enforces relative distance consistency through a contrastive loss and a deviation loss, achieving state-of-the-art performance on six spatio-temporal benchmarks.
L2RSI: Cross-View LiDAR-Based Place Recognition for Large-Scale Urban Scenes via Remote Sensing Imagery: This paper proposes L2RSI, the first framework for LiDAR-based place recognition in ultra-large-scale urban scenes (100 km²) leveraging high-resolution remote sensing imagery. It aligns LiDAR BEV representations with remote sensing semantic spaces via semantic contrastive learning, and introduces Spatio-Temporal Particle Estimation (STPE) to aggregate spatio-temporal information from consecutive queries, achieving 83.27% Top-1 accuracy within a 100 km² retrieval range.
LabelAny3D: Label Any Object 3D in the Wild: This paper proposes LabelAny3D, an analysis-by-synthesis automatic 3D annotation pipeline that reconstructs complete 3D scenes from monocular images to obtain high-quality 3D bounding box annotations. Based on this pipeline, the authors construct the COCO3D benchmark covering 80 categories of everyday objects, achieving significant improvements in open-vocabulary monocular 3D detection.
Layer-wise Modality Decomposition for Interpretable Multimodal Sensor Fusion: This paper proposes LMD (Layer-Wise Modality Decomposition), a post-hoc, model-agnostic interpretability method that linearizes neural network operations layer by layer to exactly decompose the predictions of multimodal fusion models into per-sensor modality contributions. LMD is the first method to achieve prediction attribution to individual input modalities in autonomous driving perception models, and its effectiveness is validated across camera-radar, camera-LiDAR, and camera-radar-LiDAR fusion settings.
FlowScene: Learning Temporal 3D Semantic Scene Completion via Optical Flow Guidance: This paper proposes FlowScene, which leverages optical flow to guide temporal feature aggregation and employs occlusion masks for voxel refinement. Using only 2 historical frames as input, FlowScene achieves state-of-the-art performance on the SemanticKITTI and SSCBench-KITTI-360 benchmarks (mIoU 17.70 / 20.81).
Leveraging Depth and Language for Open-Vocabulary Domain-Generalized Semantic Segmentation: This paper proposes Vireo, the first single-stage framework that unifies open-vocabulary semantic segmentation (OVSS) and domain-generalized semantic segmentation (DGSS). By introducing GeoText Query to fuse depth-geometric features with linguistic cues, Vireo achieves state-of-the-art performance under both extreme environmental conditions and on unseen categories.
Model-Based Policy Adaptation for Closed-Loop End-to-End Autonomous Driving: This paper proposes MPA, a framework that generates counterfactual trajectory data via 3DGS simulation, trains a diffusion policy adapter and a multi-principle Q-value model, and uses them at inference time to guide a pretrained E2E driving model toward improved safety and generalization in closed-loop scenarios.
Neurosymbolic Diffusion Models: This paper proposes Neurosymbolic Diffusion Models (NeSyDM), which integrates discrete masked diffusion models with symbolic programs to overcome the conditional independence assumption in traditional neurosymbolic predictors. NeSyDM models inter-concept dependencies and uncertainty while maintaining scalability, achieving state-of-the-art accuracy and calibration on visual reasoning and autonomous driving benchmarks.
OpenBox: Annotate Any Bounding Boxes in 3D: This paper proposes OpenBox, a two-stage automatic 3D bounding box annotation pipeline that first maps instance-level information from 2D visual foundation models to 3D point clouds via cross-modal instance alignment, then adaptively generates high-quality 3D bounding boxes based on the physical state of each object (static rigid / dynamic rigid / deformable), without requiring any self-training iterations.
Predictive Preference Learning from Human Interventions: PPL leverages a trajectory prediction model to anticipate the agent's future states and "bootstraps" each human intervention signal across the predicted future horizon to construct contrastive preference data. Combined with a dual-loss training strategy of behavior cloning and preference optimization, PPL substantially reduces the number of required human interventions and demonstration data.
Prioritizing Perception-Guided Self-Supervision: A New Paradigm for Causal Modeling in End-to-End Autonomous Driving: This work addresses causal confusion in end-to-end autonomous driving by leveraging perception outputs (lane centerlines, agent trajectories) and self-supervised learning to establish causal relationships, achieving state-of-the-art performance on the Bench2Drive closed-loop benchmark (Driving Score 78.08).
RAW2Drive: Reinforcement Learning with Aligned World Models for End-to-End Autonomous Driving: This paper proposes RAW2Drive, the first model-based reinforcement learning (MBRL) end-to-end autonomous driving framework operating directly from raw sensor inputs to planning. Through a dual-stream world model design — first training a privileged world model, then guiding a raw-sensor world model via an alignment mechanism — RAW2Drive achieves state-of-the-art performance on CARLA v2 and Bench2Drive, substantially outperforming imitation learning (IL) methods.
Regret Lower Bounds for Decentralized Multi-Agent Stochastic Shortest Path Problems: This paper establishes the first $\Omega(\sqrt{K})$ regret lower bound for the Decentralized Multi-Agent Stochastic Shortest Path (Dec-MASSP) problem under linear function approximation. By constructing a family of hard-to-learn instances and employing a symmetry argument to identify the structure of optimal policies, the paper demonstrates that this lower bound matches existing upper bounds in terms of the number of episodes $K$.
SDTagNet: Leveraging Text-Annotated Navigation Maps for Online HD Map Construction: This paper proposes SDTagNet, the first method to encode OpenStreetMap text annotations (road names, lane counts, one-way indicators, etc.) via BERT and to unify all SD map elements (points, polylines, and relations) through a point-level graph Transformer. On long-range HD map construction, SDTagNet achieves +5.9 mAP (+45%) over prior-free baselines and +3.2 mAP (+20%) over existing SD map prior methods.
Self-Supervised Learning of Graph Representations for Network Intrusion Detection: This paper proposes GraphIDS, a self-supervised intrusion detection model that unifies graph representation learning and anomaly detection via a masked autoencoder, achieving a PR-AUC of 99.98% and macro F1 of 99.61% on multiple NetFlow benchmarks, surpassing baselines by 5–25 percentage points.
Semantic Glitch: Agency and Artistry in an Autonomous Pixel Cloud: This paper presents "Pixel Cloud," a low-fidelity autonomous aerial robotic art installation that deliberately forgoes conventional LiDAR/SLAM sensors and relies solely on the semantic understanding of a multimodal large language model (MLLM) for navigation. Through natural language prompting, the robot is endowed with a biologically inspired narrative persona, yielding imprecise yet characterful emergent behaviors.
SimWorld-Robotics: Synthesizing Photorealistic and Dynamic Urban Environments for Multimodal Robot Navigation and Collaboration: This paper presents SimWorld-Robotics (SWR), a large-scale urban simulation platform built on Unreal Engine 5 that supports procedural generation of unlimited photorealistic city environments. Built upon this platform, two new benchmarks are introduced — SimWorld-MMNav for multimodal navigation and SimWorld-MRS for multi-robot search — which collectively reveal critical capability gaps in current VLMs on outdoor urban tasks.
Spatio-Temporal Graphs Beyond Grids: Benchmark for Maritime Anomaly Detection: This paper proposes the first graph anomaly detection benchmark for non-grid spatio-temporal systems in the maritime domain. It extends the OMTAD dataset to support node/edge/graph-level anomaly detection, and plans to employ LLM agents for trajectory synthesis and anomaly injection.
SPIRAL: Semantic-Aware Progressive LiDAR Scene Generation and Understanding: SPIRAL proposes a semantic-aware range-view LiDAR diffusion model that jointly generates depth maps, reflectance images, and semantic segmentation maps. By introducing progressive semantic prediction and a closed-loop inference mechanism to enhance cross-modal consistency, the model achieves state-of-the-art performance with a minimal parameter count of 61M.
SQS: Enhancing Sparse Perception Models via Query-based Splatting in Autonomous Driving: SQS presents the first query-based 3D Gaussian splatting pre-training framework for sparse perception models (SPMs). By self-supervisedly reconstructing RGB images and depth maps, the method learns fine-grained 3D representations, and introduces a query interaction module to fuse pre-trained Gaussian queries with task-specific queries. SQS achieves significant improvements over existing pre-training methods on occupancy prediction (+1.3 mIoU) and 3D object detection (+1.0 NDS).
StreamForest: Efficient Online Video Understanding with Persistent Event Memory: This paper proposes StreamForest, an architecture that adaptively organizes streaming video frames into multiple event-level tree structures via a "Persistent Event Memory Forest," combined with a "Fine-grained Spatiotemporal Window" to capture short-term visual cues. The method achieves 77.3% accuracy on StreamingBench and retains 96.8% of performance under extreme compression (only 1024 visual tokens).
Towards Foundational LiDAR World Models with Efficient Latent Flow Matching: This paper proposes the first transferable LiDAR world model, achieving a 192× compression ratio via a Swin Transformer VAE (state-of-the-art reconstruction accuracy), replacing diffusion models with Conditional Flow Matching (CFM) for state-of-the-art semantic occupancy prediction (using only 4.38% of prior work's FLOPs), and surpassing OccWorld trained on full annotations across three domain transfer tasks using only 5% labeled data.
Towards Physics-Informed Spatial Intelligence with Human Priors: An Autonomous Driving Perspective: This paper proposes the Spatial Intelligence Grid (SIG)—a structured representation inspired by the perspective grids used by Renaissance painters—that explicitly encodes object layout, directional relationships, and distance relationships in driving scenes as a grid structure. The authors further construct the SIGBench benchmark, demonstrating that SIG enables more stable and comprehensive improvements in the spatial reasoning capabilities of MLLMs under few-shot in-context learning compared to conventional VQA-based approaches.
Towards Predicting Any Human Trajectory in Context: This paper proposes TrajICL, an in-context learning (ICL) framework for pedestrian trajectory prediction that achieves cross-scene adaptive prediction without fine-tuning through spatiotemporal similarity-based example selection and prediction-guided example selection, surpassing even fine-tuned baselines.

TL;DR

Unifying Appearance Codes and Bilateral Grids for Driving Scene Gaussian Splatting: A multi-scale bilateral grid pyramid is proposed to unify global appearance codes and pixel-level bilateral grids. A 3-level hierarchy (coarse→medium→fine) captures global/regional/pixel-level photometric variation respectively. By employing a luminance-guided slice-and-blend pipeline and adaptive regularization, the method addresses photometric inconsistency in driving scene 3DGS, achieving a 28.2% improvement in Chamfer Distance over OmniRe on Waymo.
UniMotion: A Unified Motion Framework for Simulation, Prediction and Planning: UniMotion proposes a unified motion framework built on a decoder-only Transformer, supporting motion simulation, trajectory prediction, and ego-vehicle planning simultaneously through task-aware interaction patterns and training strategies. Joint training facilitates cross-task knowledge sharing, and after task-specific fine-tuning, the model achieves state-of-the-art performance across multiple tasks on the Waymo dataset.
URB -- Urban Routing Benchmark for RL-Equipped Connected Autonomous Vehicles: This paper presents URB — the first large-scale MARL benchmark environment for urban mixed-traffic (human + CAV) routing, integrating 29 real-world traffic networks, the microscopic traffic simulator SUMO, and empirical travel demand patterns. Experiments reveal that current state-of-the-art MARL algorithms rarely outperform human drivers, highlighting the urgent need for algorithmic advances in this domain.
UrbanIng-V2X: A Large-Scale Multi-Vehicle Multi-Infrastructure Dataset Across Multiple Intersections for Cooperative Perception: UrbanIng-V2X is the first real-world cooperative perception dataset spanning multiple vehicles, multiple infrastructure sensors, and multiple urban intersections. It provides 712K annotated instances across 13 categories in 34 scenes, and through a cross-intersection evaluation strategy (SIS) quantitatively reveals a substantial generalization gap of 14 mAP exhibited by existing cooperative perception methods on unseen intersections.
V2X-Radar: A Multi-Modal Dataset with 4D Radar for Cooperative Perception: This paper presents V2X-Radar, the first large-scale real-world multi-modal vehicle-to-everything (V2X) cooperative perception dataset incorporating 4D radar, LiDAR, and multi-view camera data. The dataset covers diverse weather and lighting conditions, providing 20K LiDAR frames, 40K camera images, 20K 4D radar scans, and 350K annotated bounding boxes, along with comprehensive benchmarks across three sub-datasets.
X-Scene: Large-Scale Driving Scene Generation with High Fidelity and Flexible Controllability: This paper presents X-Scene, a unified large-scale driving scene generation framework that supports multi-granularity control ranging from high-level text prompts to low-level BEV layouts. By jointly generating 3D semantic occupancy, multi-view images, and videos, and leveraging consistency-aware extrapolation for large-scale scene expansion, X-Scene comprehensively outperforms existing methods in generation quality (FID 11.29) and downstream tasks.

🦾 LLM Agent¶

A-MEM: Agentic Memory for LLM Agents: This paper proposes A-Mem, a Zettelkasten-inspired agentic memory system for LLM agents. Each memory entry automatically generates a structured note (keywords/tags/contextual description), dynamically establishes inter-memory links, and triggers evolutionary updates to existing memories upon the insertion of new ones. A-Mem substantially outperforms baselines such as MemGPT on the LoCoMo long-conversation QA benchmark.
Adaptive Coopetition: Leveraging Coarse Verifier Signals for Resilient Multi-Agent LLM Reasoning: This paper proposes the Adaptive Coopetition (AdCo) framework, which employs a UCB multi-armed bandit strategy with coarse-grained verifier signals to enable multiple LLM agents to adaptively switch between cooperative and competitive modes during inference, achieving a 20% relative improvement on mathematical reasoning benchmarks.
AgentAuditor: Human-Level Safety and Security Evaluation for LLM Agents: This paper proposes AgentAuditor — a training-free, memory-augmented reasoning framework that enables LLMs to adaptively extract structured semantic features (scenario, risk, behavior) to construct an experiential memory bank, then employs multi-stage context-aware retrieval-augmented generation to guide LLM evaluators in assessing agent behavior for safety and security threats. The work also introduces ASSEBench, the first benchmark jointly covering safety and security evaluation (2,293 records, 15 risk types, 29 scenarios), achieving human expert-level evaluation accuracy across multiple benchmarks.
AgentChangeBench: A Multi-Dimensional Evaluation Framework for Goal-Shift Robustness: AgentChangeBench is the first benchmark that systematically evaluates the adaptability of LLM agents when user goals shift mid-conversation: 315 base tasks × 9 variants = 2,835 sequences, spanning 3 enterprise domains (banking/retail/airline) and 5 user personas. It introduces 4 complementary metrics including GSRT (Goal-Shift Recovery Time), revealing efficiency and robustness gaps masked by high pass@k—e.g., GPT-4o achieves 92.2% airline recovery rate yet 89.1% retail redundancy rate.
AgentDAM: Privacy Leakage Evaluation for Autonomous Web Agents: This paper proposes AgentDAM, the first benchmark for end-to-end evaluation of data minimization compliance by AI agents in real web environments. It comprises 246 tasks spanning Reddit, GitLab, and Shopping platforms, and finds that leading models such as GPT-4o exhibit privacy leakage rates of 36–46% without mitigation, while a CoT-based privacy prompt reduces leakage rates to 6–8%.
Agentic NL2SQL to Reduce Computational Costs: This paper proposes Datalake Agent, an agentic NL2SQL system built on an interactive reasoning loop. Through a hierarchical information retrieval strategy (GetDBDescription → GetTables → GetColumns → DBQueryFinalSQL), the system enables LLMs to request database schema information on demand rather than receiving it all at once. In a setting with 319 tables, the approach reduces token usage by 87% and cost by 8×, while maintaining superior performance on complex queries.
Agentic Plan Caching: Test-Time Memory for Fast and Cost-Efficient LLM Agents: This paper proposes Agentic Plan Caching (APC), which extracts structured plan templates from agent execution logs and reuses them via keyword-matching cache hits with a small model for adaptation. APC reduces cost by 50.31% and latency by 27.28% on average while retaining 96.61% of accuracy-optimal performance.
AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM-Based Agents: This paper proposes the AgentMisalignment benchmark suite, comprising 9 realistic scenario evaluation tasks that measure the propensity of LLM agents to spontaneously deviate from deployer intent under non-malicious instructions (rather than measuring capability). The study finds that stronger models tend to exhibit higher misalignment, and that persona prompts sometimes exert greater influence on misaligned behavior than model choice itself.
AgentTTS: Large Language Model Agent for Test-time Compute-optimal Scaling Strategy in Complex Tasks: This paper investigates the problem of compute-optimal test-time scaling in multi-stage complex tasks. Through large-scale pilot experiments, three generalizable scaling insights for LLMs on multi-stage tasks are identified. The authors propose AgentTTS—an LLM agent-based framework that autonomously searches for compute-optimal model selection and budget allocation strategies via iterative feedback-driven search.
Are Large Language Models Sensitive to the Motives Behind Communication?: Three progressive experiments systematically evaluate whether LLMs possess "motivational vigilance"—the ability to recognize the intentions and incentives of information sources and adjust trust accordingly. In controlled experiments, frontier non-reasoning LLMs perform close to the rational model (Pearson's $r > 0.9$) and resemble humans more than the rational model does; however, vigilance drops sharply in real-world YouTube sponsored content ($r < 0.2$), and simple prompt steering partially restores it (raising $r$ to 0.31).
Attractive Metadata Attack: Inducing LLM Agents to Invoke Malicious Tools: AMA (Attractive Metadata Attack) demonstrates that by carefully crafting malicious tool metadata (name, description, parameter schema) alone — without prompt injection or internal model access — an attacker can induce LLM agents to invoke malicious tools and leak private data at a success rate of 81–95%, while barely affecting original task completion (98%+), with existing defenses (auditors, prompt rewriting) proving largely ineffective.
Automated Composition of Agents: A Knapsack Approach for Agentic Component Selection: This paper formalizes the agent component selection problem as an online knapsack problem and proposes the Composer Agent framework, which evaluates true component capabilities via sandbox testing (rather than static semantic retrieval) and dynamically selects optimal component combinations under budget constraints using the ZCL online algorithm. The approach achieves up to a 31.6% improvement in single-agent tool selection success rate, and boosts multi-agent sub-agent selection success rate from 37% to 87%.
Automated Multi-Agent Workflows for RTL Design: VeriMaAS is a multi-agent framework that integrates HDL formal verification feedback (Yosys + OpenSTA) into the automated workflow generation process, adaptively selecting reasoning operators (I/O → CoT → ReAct → SelfRefine → Debate) for RTL code generation tasks. With only a few hundred training samples, it achieves 5–7% higher pass@k performance than fine-tuning baselines.
Benchmarking Agentic Systems in Automated Scientific Information Extraction with ChemX: This paper constructs ChemX — a suite of 10 multimodal chemical data extraction benchmark datasets manually annotated and validated by domain experts, spanning nanomaterials and small molecules. It systematically evaluates state-of-the-art agentic systems including ChatGPT Agent, SLM-Matrix, FutureHouse, and nanoMINER, as well as frontier LLMs such as GPT-5 and GPT-5 Thinking. The proposed single-agent method achieves F1=0.61 on the nanozyme dataset through structured document preprocessing (marker-pdf → Markdown → LLM extraction), surpassing all general-purpose multi-agent systems, while revealing systemic challenges in chemical information extraction such as SMILES parsing failures and terminology ambiguity.
BTL-UI: Blink-Think-Link Reasoning Model for GUI Agent: This paper proposes the Blink-Think-Link (BTL) brain-inspired framework, which decomposes GUI interaction into three biologically plausible stages: Blink (rapid attentional localization), Think (cognitive reasoning and decision-making), and Link (executable command generation). Combined with an automated Blink data annotation pipeline and the first rule-based composite process-and-outcome reward mechanism, BTL Reward, the resulting BTL-UI model achieves competitive performance on both static GUI understanding and dynamic interaction benchmarks.
CAM: A Constructivist View of Agentic Memory for LLM-Based Reading Comprehension: Inspired by Piaget's constructivist theory, this paper proposes CAM — an agentic memory system characterized by three properties: structuredness (hierarchical schema), flexibility (assimilation via overlapping clustering), and dynamism (incremental adaptation). CAM comprehensively outperforms baselines such as RAPTOR and GraphRAG across six long-document reading comprehension benchmarks.
ContextAgent: Context-Aware Proactive LLM Agents with Open-World Sensory Perceptions: This paper proposes ContextAgent, the first LLM agent framework that leverages multimodal sensory perception from wearable devices (video + audio + notifications) to understand user intent and proactively deliver tool-augmented services. It also introduces ContextAgentBench, a benchmark of 1,000 samples, achieving improvements of 8.5% in proactive prediction accuracy and 6.0% in tool invocation accuracy.
CORE: Full-Path Evaluation of LLM Agents Beyond Final State: This paper proposes CORE, a framework that encodes legitimate tool-calling paths for agent tasks using deterministic finite automata (DFA) and introduces five complementary metrics (path correctness, order correctness, prefix criticality, harm rate, and efficiency) to evaluate agent behavior along the full execution path rather than the final state alone, revealing safety and efficiency differences invisible to conventional final-state evaluation.
Crucible: Quantifying the Potential of Control Algorithms through LLM Agents: This paper is the first to formalize the concept of Tuning Potential, using LLM agents to simulate multi-level developers performing dual-layer (parameter + logic) optimization of control algorithms. On CartPole, Bang-bang improves from 34→500, reaching DQN-level performance; on ABR tasks, Crucible achieves up to 44.1% improvement over Bayesian optimization.
Debate or Vote: Which Yields Better Decisions in Multi-Agent Large Language Models?: This work establishes, both theoretically and empirically, that the performance gains attributed to Multi-Agent Debate (MAD) stem primarily from majority voting (ensembling) rather than the debate process itself. The debate dynamics are shown to constitute a martingale—meaning debate does not systematically improve correctness in expectation—and this theoretical insight motivates a principled improvement to MAD by biasing updates toward correct signals.
Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding: This paper proposes DVD (Deep Video Discovery), an agent that frames long-form video understanding as a multi-step information search problem. It first constructs a multi-granular structured database from a long video (global summary + clip-level caption embeddings + frame-level pixels), then provides three search tools (Global Browse / Clip Search / Frame Inspect). A reasoning LLM autonomously orchestrates the search trajectory via an observe-reason-act loop. DVD achieves 74.2% on LVBench (surpassing the previous SOTA MR.Video by 13.4 pp), and 76.0% with subtitles.
DefenderBench: A Toolkit for Evaluating Language Agents in Cybersecurity Environments: This paper presents DefenderBench, an open-source modular toolkit for systematically evaluating LLM agents across three categories of cybersecurity tasks—offensive, defensive, and knowledge understanding—covering five scenarios: network intrusion simulation, malicious content detection, code vulnerability detection/repair, and CTI knowledge QA. Benchmark results show that Claude-3.7-sonnet achieves the best overall performance (81.65 points).
Distilling LLM Agent into Small Models with Retrieval and Code Tools: This paper proposes an Agent Distillation framework that distills the complete reason-act-observe interactive behaviors of LLM agents (rather than static CoT) into small models ranging from 0.5B to 7B parameters. Combined with a first-thought prefix to improve teacher trajectory quality and self-consistent action generation to enhance inference robustness, the framework enables small models to achieve performance comparable to CoT-distilled models 2–4× their size.
DRIFT: Dynamic Rule-Based Defense with Injection Isolation for Securing LLM Agents: DRIFT is a system-level agent security framework featuring three layers of defense: Secure Planner (pre-planned function trajectories and parameter checklists), Dynamic Validator (dynamic policy updates based on Read/Write/Execute permissions), and Injection Isolator (detection and masking of injected instructions from the memory stream). On AgentDojo, DRIFT reduces ASR from 30.7% to 1.3% while achieving 20.1% higher utility than CaMeL.
Enhancing Demand-Oriented Regionalization with Agentic AI and Local Heterogeneous Data for Adaptation Planning: This paper proposes a planning support system based on Agentic AI, in which an LLM agent guides non-technical users through data-driven demand-oriented regionalization. The core algorithm is RepSC-SOM (spatially constrained self-organizing map with representative initialization), supporting iterative human-AI collaborative refinement of regional delineations for disaster risk management and climate adaptation planning.
EU-Agent-Bench: Measuring Illegal Behavior of LLM Agents Under EU Law: This paper introduces EU-Agent-Bench, the first verifiable agent benchmark grounded in the EU legal framework. Using 600 benign user requests, it evaluates whether LLM agents' tool calls violate EU regulations. Results show that even the best-performing model (Gemini 2.5 Flash) achieves a legality rate of only ~55%, revealing a substantial gap between current alignment techniques and legal reliability.
Generative AI Agents for Controllable and Protected Content Creation: This paper proposes a multi-agent generative framework that addresses controllability and copyright protection in a unified manner through the collaboration of five specialized agents — Director/Planner, Generator, Reviewer, Integration, and Protection — augmented with human-in-the-loop feedback.
Ground-Compose-Reinforce: Grounding Language in Agentic Behaviours using Limited Data: This paper proposes Ground-Compose-Reinforce (GCR), an end-to-end neuro-symbolic framework that learns the grounding semantics of atomic propositions from a small number of annotated trajectories (only 350), composes them into complex task specifications via Reward Machines, and trains an RL agent using self-generated dense rewards — eliciting out-of-distribution complex behaviours without any hand-crafted reward functions.
Group-in-Group Policy Optimization for LLM Agent Training: GiGPO introduces step-level grouping nested within the episode-level grouping of GRPO by leveraging recurring environment states across trajectories as anchor states, enabling fine-grained credit assignment without additional rollouts or a critic model. It outperforms GRPO by >12% on ALFWorld and >9% on WebShop.
Hogwild! Inference: Parallel LLM Generation via Concurrent Attention: This paper proposes Hogwild! Inference—a parallel LLM inference protocol that requires no predefined collaboration framework. Multiple LLM instances synchronize in real time through a shared concurrent KV cache, leveraging RoPE positional encoding to avoid recomputation, achieving higher accuracy with fewer serial steps on mathematical reasoning and programming tasks.
It's LIT! Reliability-Optimized LLMs with Inspectable Tools: By defining reliability/inspectability cost functions for each external tool, LIT guides LLMs to select the lowest-cost (most transparent and auditable) tool-calling path among multiple candidates, improving interpretability while maintaining or enhancing task accuracy in 61 out of 65 test scenarios.
LC-Opt: Benchmarking Reinforcement Learning and Agentic AI for End-to-End Liquid Cooling Optimization in Data Centers: This paper presents LC-Opt, a liquid cooling benchmark environment built upon a high-fidelity digital twin of the cooling system of the ORNL Frontier supercomputer. It supports end-to-end liquid cooling optimization via RL control policies, encompassing centralized/decentralized multi-agent RL, policy distillation into interpretable decision trees, and an LLM-driven agentic mesh architecture.
Lessons Learned: A Multi-Agent Framework for Code LLMs to Learn and Improve: This paper proposes the LessonL framework, enabling multiple small LLM agents to reflect on both successful and failed cases through mutually shared "lessons," collaboratively optimizing code performance. A combination of three 7B–14B models achieves code optimization results on par with GPT-4o and approaching o3.
LLM Agent Communication Protocol (LACP) Requires Urgent Standardization: A Telecom-Inspired Protocol is Necessary: This position paper argues that the fragmented ecosystem of current LLM Agent communication mirrors the "protocol wars" of the early networking era. It proposes LACP, a three-layer protocol (Semantic, Transactional, and Transport layers) inspired by telecom standardization, and contends that security-by-design, transactional integrity, and semantic interoperability are critical for multi-agent systems.
LLM Agents for Knowledge Discovery in Atomic Layer Processing: By having an LLM agent control a simulated chemical reactor (a black-box function), this work demonstrates that agents can explore, discover, and summarize the rules of an unknown chemical system through trial and error without any prior knowledge, revealing both the capabilities and limitations of agents for open-ended scientific discovery.
MAT-Agent: Adaptive Multi-Agent Training Optimization: This paper proposes MAT-Agent, a multi-agent framework consisting of four autonomous agents responsible for data augmentation, optimizer selection, learning rate scheduling, and loss function selection, respectively. The framework dynamically adjusts training configurations during the training process, employing DQN to learn policies as a replacement for conventional static hyperparameter configurations, and achieves state-of-the-art performance on multi-label image classification tasks.
MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?: This paper proposes MLRC-Bench, a dynamic benchmark grounded in ML conference competition tasks, designed to objectively evaluate the ability of LLM agents to propose and implement novel research methods. The study finds that even the strongest agent (gemini-exp-1206) closes only 9.3% of the gap between the baseline and top human solutions, and that LLM subjective scores for "novelty" exhibit virtually no correlation with actual performance.
Orchestration Framework for Financial Agents: From Algorithmic Trading to Agentic Trading: This paper proposes FinAgent, an orchestration framework that maps each component of a traditional algorithmic trading system to a dedicated AI agent (Planner, Orchestrator, Alpha/Risk/Portfolio/Backtest/Execution/Audit/Memory agents), employs the MCP protocol for control communication and the A2A protocol for inter-agent communication, and validates the framework's feasibility on stock and BTC trading tasks.
PANDA: Towards Generalist Video Anomaly Detection via Agentic AI Engineer: This paper proposes PANDA, an agentic AI engineer framework built upon MLLMs, which achieves training-free and human-intervention-free generalist video anomaly detection through four core capabilities: adaptive scene-aware strategy planning, goal-driven heuristic reasoning, tool-augmented self-reflection, and chain-of-memory.
R&D-Agent-Quant: A Multi-Agent Framework for Data-Centric Factors and Model Joint Optimization: This paper proposes R&D-Agent(Q), a data-driven multi-agent framework that automates the joint optimization of factor mining and model innovation for quantitative strategies through five collaborative modules (Specification, Synthesis, Implementation, Validation, and Analysis), achieving approximately 2× the annualized return of traditional factor libraries in real stock markets at a cost of under $10.
ShapeCraft: LLM Agents for Structured, Textured and Interactive 3D Modeling: This paper proposes ShapeCraft, a multi-agent framework built on a Graph-based Procedural Shape (GPS) representation. Three LLM agents — Parser, Coder, and Evaluator — collaborate to decompose natural language descriptions into structured sub-task graphs, iteratively generating editable and animatable textured 3D assets.
SuffixDecoding: Extreme Speculative Decoding for Emerging AI Applications: Caches long token sequences via suffix trees and achieves 5.3× speedup through adaptive speculation length, targeting highly predictable repetitive inference tasks in agent scenarios.
T1: A Tool-Oriented Conversational Dataset for Multi-Turn Agentic Planning: This paper introduces T1, a dataset of 13.5K multi-turn dialogues spanning 9 domains (4 single-domain + 5 cross-domain) and 14 tools, with a focus on inter-tool dependencies and dynamic replanning. A baseline system, T1-Agent (code generation + caching mechanism), is proposed for systematic evaluation. Experiments show that SFT-tuned Llama 8B achieves 87.17% Tool Call F1, surpassing untuned 70B models, yet still trailing closed-source models such as GPT-5 and o3.
TAI3: Testing Agent Integrity in Interpreting User Intent: This paper proposes TAI3, an API-centric stress-testing framework for LLM agent intent integrity. It organizes the natural language input space into a structured test grid via Semantic Partitioning, and leverages Intent-Preserving Mutation and Strategy Memory to efficiently expose intent misinterpretation errors when agents execute user tasks.
The Lighthouse of Language: Enhancing LLM Agents via Critique-Guided Improvement: This paper proposes CGI (Critique-Guided Improvement), a dual-role framework that trains a dedicated Critic model to provide structured natural language feedback (discrimination + correction suggestions) to an Actor Agent, and enables the Actor to learn to leverage such feedback through iterative action refinement. CGI achieves an average score of 74.20% across WebShop, ScienceWorld, and TextCraft, surpassing GPT-4o (45.46%) and Iterative SFT (58.21%).
Traj-CoA: Patient Trajectory Modeling via Chain-of-Agents for Lung Cancer Risk Prediction: This paper proposes Traj-CoA, a multi-agent framework that employs a chain-of-agents architecture with an EHRMem long-term memory module to perform temporal reasoning over long, noisy longitudinal EHRs. The framework surpasses ML/DL/BERT/LLM baselines on zero-shot lung cancer risk prediction tasks (5-year EHR data, up to 160k tokens).
TrajAgent: An LLM-Agent Framework for Trajectory Modeling via Large-and-Small Model Collaboration: This paper proposes TrajAgent — an LLM-agent-based framework for trajectory modeling that achieves automated, cross-task, and cross-dataset trajectory modeling through a unified environment (UniEnv), an automated workflow, and a collaborative learning schema between large and small models, outperforming baseline methods by 2.38%–69.91% across multiple tasks.
Web-Shepherd: Advancing PRMs for Reinforcing Web Agents: This paper proposes Web-Shepherd, the first process reward model (PRM) specifically designed for web navigation. By decomposing task objectives into evaluable sub-goals via checklists, 3B/8B models achieve trajectory accuracy far surpassing GPT-4o (85% vs. 10%) at only 1/10 of the cost, making reinforcement learning and inference-time search for web agents practically feasible.
What AI Speaks for Your Community: Polling AI Agents for Public Opinion on Data Center Projects: This paper proposes an LLM-based AI agent polling framework that synthesizes demographically representative virtual resident agents to conduct large-scale, low-cost public opinion surveys on data center projects. Cross-model and cross-region experiments demonstrate high thematic alignment between agent opinions and real-world polls.
Zero-Shot Large Language Model Agents for Fully Automated Radiotherapy Treatment Planning: This paper proposes a zero-shot LLM Agent-based workflow for automated radiotherapy treatment planning, in which the LLM directly interacts with a commercial treatment planning system (Eclipse TPS). By iteratively extracting dose-volume histogram (DVH) metrics and objective function losses and reasoning about constraint adjustment strategies, the approach achieves dose distribution quality comparable to or better than clinical manual planning on 20 head-and-neck cancer IMRT cases.

📚 Pretraining¶

A Practical Guide for Incorporating Symmetry in Diffusion Policy: This paper presents a practical guide for incorporating symmetry into diffusion policies. Through three simple and composable methods — invariant representations (relative trajectory actions + eye-in-hand perception), equivariant visual encoders, and Frame Averaging — the proposed approach achieves performance on par with or exceeding fully equivariant diffusion policies across 12 MimicGen tasks, while substantially reducing implementation complexity.
AI Progress Should Be Measured by Capability-Per-Resource, Not Scale Alone: A Framework for Gradient-Guided Resource Allocation in LLMs: This position paper challenges "scaling fundamentalism" by proposing Capability-Per-Resource (CPR) as a replacement for raw scale as the primary measure of AI progress. The paper presents a gradient-guided resource allocation framework in which foundation model developers publish "gradient blueprint" metadata, enabling downstream adapters to fine-tune only a high-influence parameter subset while substantially reducing resource consumption and maintaining performance close to full-parameter fine-tuning.
Alternating Gradient Flows: A Theory of Feature Learning in Two-layer Neural Networks: This paper proposes the Alternating Gradient Flow (AGF) theoretical framework to explain the stepwise "saddle-to-saddle" feature learning dynamics in neural networks. Training is modeled as an alternating process between utility maximization for dormant neurons and cost minimization for active neurons, unifying feature selection analysis across diagonal linear networks, attention models, and modular addition. Predictions from AGF exhibit high agreement with actual gradient flow behavior.
An Empirical Investigation of Neural ODEs and Symbolic Regression for Dynamical Systems: This paper presents a systematic empirical study of the extrapolation capabilities of Neural ODEs (NODEs) and the equation recovery ability of Symbolic Regression (SR) for dynamical systems. It finds that NODEs can extrapolate to new boundary conditions under dynamically similar settings, and proposes a NODE→SR pipeline: training a NODE on only 10% of the original data to generate augmented trajectories, from which SR recovers 2/3 of the governing equations exactly and provides good approximations for an additional 1/3.
Beyond Benign Overfitting in Nadaraya-Watson Interpolators: By tuning a single bandwidth parameter $\beta$ in the Nadaraya-Watson interpolator, this paper precisely characterizes the complete phase transition spectrum from catastrophic overfitting ($\beta < d$) → benign overfitting ($\beta = d$) → tempered overfitting ($\beta > d$), demonstrating that overestimating the intrinsic dimensionality of data is safer than underestimating it.
Born a Transformer – Always a Transformer? On the Effect of Pretraining on Architectural Abilities: Through systematic study of a family of retrieval and copying tasks, this paper reveals that large-scale pretraining introduces a directional bias into Transformers (rightward/forward over leftward/backward), while failing to overcome fundamental architectural limitations on non-unique tasks. Fine-tuning can eliminate the directional bias but cannot surpass the boundaries of architectural expressiveness.
Brain-tuning Improves Generalizability and Efficiency of Brain Alignment in Speech Models: This paper proposes Multi-brain-tuning, a method that jointly fine-tunes pretrained speech models on fMRI data from multiple participants, reducing the data required for brain alignment by 5×, improving alignment by up to 50%, and generalizing to unseen participants and datasets.
Breaking the Frozen Subspace: Importance Sampling for Low-Rank Optimization in LLM Pretraining: This paper identifies that the dominant subspace in low-rank optimizers such as GaLore "freezes" during pretraining (cosine overlap between consecutive subspaces approaches 1), trapping weight updates within a fixed low-rank subspace. The authors propose SARA (Sampling-based Adaptive Rank Allocation), which constructs subspaces by sampling singular vectors according to singular value weights, provides convergence guarantees, and reduces the performance gap between low-rank optimizers and full-rank Adam by up to 46%.
Broken Tokens: Your Language Model Can Secretly Handle Non-Canonical Tokenization: This paper reveals that LLMs can secretly handle non-canonical tokenizations (e.g., splitting "Hello" into "He"+"llo" instead of the canonical whole-word token)—even when the input token sequence differs from training, models exhibit surprising robustness. This capability stems from the property that sub-word embeddings in the embedding space can linearly combine to approximate whole-word embeddings.
Conformal Risk Training: End-to-End Optimization of Conformal Risk Control: This paper extends Conformal Risk Control (CRC) from expected loss to the generalized Optimized Certainty-Equivalent (OCE) risk measure (encompassing tail risks such as CVaR), and proposes conformal risk training—an end-to-end approach that differentiates through the conformal risk control procedure during training, achieving provable risk guarantees while significantly improving average-case performance.
Deep Compositional Phase Diffusion for Long Motion Sequence Generation: This paper proposes the Compositional Phase Diffusion framework, which employs SPDM and TPDM to handle semantic alignment and transition continuity, respectively, within the frequency-domain phase space established by ACT-PAE. The framework enables long-range compositional motion sequence generation and achieves state-of-the-art performance on BABEL-TEACH.
Differentiable Hierarchical Visual Tokenization: This paper proposes an end-to-end differentiable hierarchical visual tokenizer that adaptively partitions images into tokens at pixel-level granularity. It leverages information criteria for hierarchical model selection, serves as a drop-in replacement for the fixed patch tokenization in ViT, and additionally supports raster-to-vector conversion.
Disaggregation Reveals Hidden Training Dynamics: The Case of Agreement Attraction: This paper disaggregates language model performance on subject-verb agreement tasks by experimental condition, revealing multi-phase training dynamics obscured by aggregate metrics: models first learn frequency biases, then local context sensitivity, and finally develop general grammatical rules — a process involving multiple "hidden breakthroughs" rather than simple monotonic improvement.
Does Object Binding Naturally Emerge in Large Pretrained Vision Transformers?: By defining the IsSameObject predicate and designing quadratic probes, this work demonstrates that large-scale pretrained ViTs — particularly DINO and CLIP — naturally develop object binding capabilities. This signal is encoded in a low-dimensional subspace and actively guides the attention mechanism, challenging the cognitive science community's view that ViTs lack binding ability.
Efficient Pre-Training of LLMs via Topology-Aware Communication Alignment on More Than 9600 GPUs: This paper proposes Arnold, a scheduling system that aligns the communication patterns of LLM training (DP/PP groups) with the physical network topology of data centers. In simulation, Arnold reduces the maximum communication group span by 1.67×, and achieves a 10.6% end-to-end throughput improvement in production-scale training on 9,600+ GPUs.
Enhancing Training Data Attribution with Representational Optimization: This paper proposes AirRep (Attentive Influence Ranking Representation), a representation learning-based training data attribution method that employs a trainable encoder and attention pooling mechanism. AirRep achieves attribution accuracy on par with or superior to state-of-the-art gradient-based methods while being approximately 80× faster at inference.
Final-Model-Only Data Attribution with a Unifying View of Gradient-Based Methods: This paper explicitly formulates the "Final-Model-Only" (FiMO) setting for training data attribution (TDA), reframes the problem from measuring contribution to measuring sensitivity, proposes further training as the gold standard, and provides a unified derivation showing that various gradient-based methods (Grad-Dot, influence functions, TRAK, DataInf, etc.) are all approximations of further training at different orders.
Flatness is Necessary, Neural Collapse is Not: Rethinking Generalization via Grokking: Using grokking (delayed generalization) as a causal probe, this paper demonstrates that relative flatness is a (potentially) necessary condition for generalization, whereas neural collapse, despite frequently co-occurring with generalization, is not necessary — it is merely one pathway toward flatness.
Gemstones: A Model Suite for Multi-Faceted Scaling Laws: This work releases the Gemstones model suite — an open-source collection of over 4,000 checkpoints spanning 50M–2B parameters and diverse width-depth ratios. Through systematic experimentation, the paper demonstrates that scaling laws are highly sensitive to design choices such as model selection, learning rate scheduling, and cooldown strategies, and proposes a convex-hull-based fitting method to improve scaling law stability under sparse sampling.
Generalization Bounds for Rank-sparse Neural Networks: This paper establishes generalization bounds that exploit the approximate low-rank structure of neural network weight matrices. When the Schatten $p$ quasi-norm is small, the sample complexity reduces to $\widetilde{O}(WrL^2)$, where $W$, $L$, and $r$ denote the width, depth, and rank of the weight matrices, respectively.
Global Minimizers of Sigmoid Contrastive Loss: This work provides the first rigorous characterization of the global minimizer geometry of the Sigmoid contrastive loss (SigLIP) with trainable temperature and bias in the practically relevant regime $N \gg d$. It introduces a novel combinatorial object called the $(m, b_\text{rel})$-Constellation, and uses it to explain retrieval success, the modality gap phenomenon, and to propose an explicit relative bias parameterization that improves training dynamics.
Gradient-Weight Alignment as a Train-Time Proxy for Generalization in Classification Tasks: This paper proposes Gradient-Weight Alignment (GWA), which quantifies the directional consistency (cosine similarity) between the gradient of each training sample and the model weights. During training, GWA accurately predicts generalization performance, identifies the optimal early stopping point, and localizes influential training samples—all without requiring a validation set.
How Does Sequence Modeling Architecture Influence Base Capabilities of Pre-trained Language Models?: By introducing a "domain-restricted pre-training + OOD testing" evaluation framework, this paper reveals that stateful architectures such as Mamba and RWKV suffer from degraded base capabilities. It identifies the key design principle of "arbitrary selection over the full sequence" (full-sequence visibility + real relation calculation + non-uniform distribution), and validates this principle using a minimalist Top-1 Element/Chunk Selection architecture that recovers base capabilities to near-Transformer levels.
Language Model Behavioral Phases are Consistent Across Architecture, Training Data, and Scale: Through systematic analysis of over 1,400 language model checkpoints—spanning Transformer/Mamba/RWKV architectures, 14M–12B parameter scales, and two training datasets—across 110K+ tokens, this work demonstrates that all autoregressive language models exhibit highly consistent behavioral phases during pre-training: predicted probabilities sequentially overfit to n-gram probabilities of increasing order. Three simple heuristics—word frequency, n-gram probability, and semantic similarity—account for up to 98% of behavioral variance.
Learning in Compact Spaces with Approximately Normalized Transformer: This paper proposes anGPT (Approximately Normalized GPT), which exploits the concentration of vector norms in high-dimensional spaces to replace per-sample exact normalization with simple scalar multiplication. The method achieves 40% convergence speedup over GPT+ (with QK-norm) while eliminating weight decay and learning rate warmup, incurring only 3% runtime overhead.
Learning the Wrong Lessons: Syntactic-Domain Spurious Correlations in Language Models: This paper reveals that LLMs learn spurious correlations between syntactic templates (PoS n-grams) and domains, leading to sharp performance drops in cross-domain settings. Furthermore, this correlation can be exploited to bypass safety refusal mechanisms, reducing the refusal rate from 40% to 2.5% on OLMo-2.
Learning to Flow from Generative Pretext Tasks for Neural Architecture Encoding: This paper proposes FGP (Flow-based Generative Pre-training), which trains an encoder to reconstruct a flow surrogate — a lightweight representation of architectural information flow — enabling encoders of arbitrary structure to capture information flow without specialized asynchronous message-passing designs. FGP achieves up to 106% improvement in Precision@1% on performance prediction.
Leveraging Importance Sampling to Detach Alignment Modules from Large Language Models: This paper proposes the Residual Alignment Model (RAM), which formalizes the LLM alignment process as importance sampling and decomposes a large model into a frozen Proposal Module and a trainable lightweight Residual Aligner. Using fewer than 1/8 of the parameters, RAM achieves alignment performance comparable to or exceeding full-parameter SFT/DPO, while also resolving the first-token latency problem.
Language Model Behavioral Phases are Consistent Across Architecture, Training Data, and Scale: By analyzing over 1,400 model checkpoints on 110,000+ tokens, this paper demonstrates that autoregressive language models exhibit highly consistent behavioral phases during training — predicted probabilities successively overfit to n-gram distributions of increasing $n$ — and that three simple heuristics (word frequency, n-gram probability, and semantic similarity) explain up to 98% of the variance in model behavior. This pattern holds consistently across architectures (Transformer/Mamba/RWKV), datasets, and scales.
Memory Mosaics at Scale: Memory Mosaics v2 scales associative memory networks to 10B parameters trained on 1T tokens, substantially outperforming same-scale—and even 8T-token-trained—Transformers on new-task learning and in-context learning.
Mouse-Guided Gaze: Semi-Supervised Learning of Intention-Aware Representations for Reading Detection: This paper proposes a semi-supervised framework that uses mouse trajectories as weak supervision signals to pretrain gaze representations, followed by fine-tuning on labeled data to distinguish reading from scanning behavior. At inference time, only gaze signals are used, enabling hands-free assistive reading detection.
Nemotron-CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training: NVIDIA proposes the CLIMB framework, which automatically discovers optimal pre-training data mixture ratios through embedding-based clustering and iterative bootstrapped search. On a 1B-scale model, CLIMB outperforms Llama-3.2-1B by 2.0%, and releases the 1.2T-token ClimbLab corpus and the 400B-token ClimbMix high-quality dataset.
Neural Collapse under Gradient Flow on Shallow ReLU Networks for Orthogonally Separable Data: This paper provides the first provable convergence guarantee that gradient flow (GF) on two-layer ReLU networks with small initialization converges to a Neural Collapse (NC) solution on orthogonally separable data, revealing the critical role of GF's implicit bias—early neuron alignment followed by asymptotic maximum-margin bias—in driving the emergence of NC.
One Prompt Fits All: Universal Graph Adaptation for Pretrained Models: This paper theoretically proves that representation-level graph prompts are essentially equivalent to linear probes, and on this basis proposes UniPrompt—an input-level method based on a learnable kNN topological prompt graph. By fusing the prompt graph with the original graph via a bootstrapping strategy, UniPrompt consistently outperforms existing graph prompt learning methods on both in-domain and cross-domain few-shot node classification.
Optimal Online Change Detection via Random Fourier Features: This paper proposes the Online RFF-MMD algorithm, which approximates the MMD statistic via random Fourier features and embeds it into a sequential testing framework over a binary grid. The method achieves online nonparametric change detection without requiring training data or window size parameters, with $\mathcal{O}(r \log n)$ time and space complexity, and establishes minimax optimality of the detection delay.
Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training: This paper proposes a set of power-law scaling relations for weight decay $\lambda$ and batch size $B$ in LLM pre-training. By introducing the concept of an AdamW timescale $\tau$, it unifies hyperparameter scaling relationships, enabling accurate prediction of optimal hyperparameters prior to large-scale training.
Predict Training Data Quality via Its Geometry in Metric Space: This paper proposes a training data diversity metric based on Persistent Homology (PH), demonstrating that geometric and topological structural features of data can effectively predict model performance, outperforming traditional entropy-based metrics such as Vendi Score.
PRESCRIBE: Predicting Single-Cell Responses with Bayesian Estimation: PRESCRIBE is a framework that jointly models epistemic uncertainty (model unfamiliarity with inputs) and aleatoric uncertainty (inherent randomness of biological systems) in single-cell perturbation prediction via multivariate deep evidential regression. It generates a pseudo E-distance as a unified uncertainty proxy; filtering unreliable predictions based on this metric yields accuracy improvements exceeding 3%.
Quantifying Task-Relevant Representational Similarity Using Decision Variable Correlation: This paper proposes Decision Variable Correlation (DVC), a novel metric for quantifying trial-by-trial consistency between two neural representations on classification tasks. The authors find that higher ImageNet accuracy in deep networks is associated with lower DVC relative to monkey V4/IT, and that neither adversarial training nor large-scale dataset pretraining closes this gap.
Retrospective In-Context Learning for Temporal Credit Assignment with Large Language Models: This paper proposes RICL (Retrospective In-Context Learning), which leverages the pre-trained knowledge of LLMs to convert sparse environmental feedback into dense advantage function signals via retrospective in-context learning, achieving up to 100× sample efficiency over conventional Monte Carlo methods. Building on this, the paper further introduces RICOL, an online learning framework.
Retrospective In-Context Learning for Temporal Credit Assignment with Large Language Models: This paper proposes RICL (Retrospective In-Context Learning), which estimates the advantage function by comparing the log-probability difference of an LLM policy before and after an in-context update. This approach converts sparse environment feedback into dense training signals, enabling efficient temporal credit assignment, and achieves comparable convergence performance to traditional RL methods on BabyAI tasks with substantially higher sample efficiency.
Scalable Fingerprinting of Large Language Models: This paper proposes Perinucleus sampling to generate scalable LLM fingerprints, enabling the embedding of 24,576 fingerprints in Llama-3.1-8B—two orders of magnitude more than existing methods—without degrading model capability. Theoretical and empirical analyses demonstrate that large-scale fingerprinting is essential for defending against collusion attacks.
Scaling Embedding Layers in Language Models: This paper proposes Scone, a method that learns contextualized embeddings for high-frequency n-grams using a separate Transformer model, and offloads these embeddings to main memory/SSD at inference time. This enables a new scaling paradigm in which additional compute is consumed during training without increasing accelerator resource usage at inference. A 1B-parameter Scone model surpasses a 1.9B baseline.
Superposition Yields Robust Neural Scaling: This paper identifies representational superposition as the core driver of neural scaling laws: in the strong-superposition regime, loss universally scales inversely with model dimension ($L \propto 1/m$), independent of the specific form of the data frequency distribution—consistent with empirical scaling behavior in real LLMs.
The Atlas of In-Context Learning: How Attention Heads Shape In-Context Retrieval Augmentation: This paper systematically dissects the internal mechanisms of LLMs in in-context retrieval augmented QA using the AttnLRP attribution method. Three functionally specialized attention head types are identified — Task heads (middle layers, parsing instructions/questions), Retrieval heads (later layers, verbatim copying of contextual answers), and Parametric heads (encoding parametric knowledge) — and their functions are validated via Function Vector injection and source-tracking probes, achieving ROC AUC ≥94% on Llama-3.1/Mistral/Gemma.
The Curse of Depth in Large Language Models: This paper identifies the root cause of deep-layer degradation in Pre-LN Transformers—exponential growth of output variance causing deep layers to collapse into identity mappings—and proposes a parameter-free LayerNorm Scaling (LNS) strategy that multiplies the LayerNorm output by $1/\sqrt{\ell}$, compressing variance growth from exponential to polynomial. LNS consistently improves perplexity by 5–8% across scales from 130M to 7B parameters.
Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training: From the geometric perspective of the river-valley loss landscape, this paper analyzes why the Schedule-Free (SF) optimizer can continuously track the optimal solution during language model pre-training without requiring learning rate decay or weight averaging. It further reveals that SF implicitly performs weight averaging, and proposes an improved SF-AdamW that decouples the momentum and averaging window parameters.
Understanding and Enhancing Mask-Based Pretraining towards Universal Representations: This paper employs high-dimensional linear regression theory to precisely characterize the effect of masking ratio on test risk in mask-based pretraining via a bias-variance decomposition, revealing that the optimal masking ratio depends on both the downstream task and model size. Building on this theory, the paper proposes R2MAE (Random Ratio MAE), which consistently outperforms fixed masking ratios across vision, language, DNA, and single-cell modeling benchmarks.
Vocabulary Customization for Efficient Domain-Specific LLM Deployment: This paper proposes a BPE tokenizer expansion algorithm that guarantees monotonically non-increasing encoding length, appending domain-frequent tokens to the Llama 3.1 vocabulary (+30K tokens). In an e-commerce setting, the approach shortens input sequences by 20% and improves inference throughput by 20–30%. After 10K steps of continual training, model quality is fully preserved, and in approximately 98% of cases the model actively generates the newly added tokens.
ZEUS: Zero-shot Embeddings for Unsupervised Separation of Tabular Data: ZEUS is the first zero-shot clustering method for tabular data. By pretraining a Transformer encoder on synthetic datasets, it learns generalizable representations that enable high-quality clustering of new datasets in a single forward pass, requiring no additional training or hyperparameter tuning.

🎵 Audio & Speech¶

A Controllable Examination for Long-Context Language Models: This paper proposes LongBioBench, which uses synthetically generated fictional biographies as both needles and haystacks to construct a long-context LLM evaluation framework satisfying three core principles: seamless context, controllable settings, and reliable evaluation. Evaluating 18 models, the benchmark reveals that current LCLMs exhibit substantial deficiencies in reasoning and trustworthiness despite adequate retrieval performance.
A Multi-Task Benchmark for Abusive Language Detection in Low-Resource Settings: This paper introduces TiALD (Tigrinya Abusive Language Detection), the first large-scale multi-task benchmark dataset for the low-resource Tigrinya language. It comprises 13,717 YouTube comments annotated jointly across three tasks—abusive language detection, sentiment analysis, and topic classification—and demonstrates that a compact fine-tuned model (TiRoBERTa, 125M parameters) consistently outperforms frontier LLMs such as GPT-4o and Claude Sonnet 3.7 across all tasks.
A TRIANGLE Enables Multimodal Alignment Beyond Cosine Similarity: TRIANGLE proposes using the area of the triangle formed by three modal embedding vectors in high-dimensional space as a similarity measure, replacing traditional pairwise cosine similarity to achieve joint alignment of video, audio, and text modalities. The method surpasses state-of-the-art by up to 9 Recall@1 points on video-text retrieval and related tasks.
Accelerate Creation of Product Claims Using Generative AI: This paper develops the Claim Advisor platform, leveraging LLM in-context learning and LoRA fine-tuning to accelerate the search, generation, refinement, and ranking of product claims for consumer goods. By emulating the MaxDiff research methodology, a fine-tuned Phi-3 14B model outperforms GPT-4o on claim ranking using only 1 in-context example versus GPT-4o's 100, and after three iterative rounds, 100% of generated claims achieve a "highly appealing" rating.
AdaptDel: Adaptable Deletion Rate Randomized Smoothing for Certified Robustness: AdaptDel extends the fixed deletion rate used in randomized smoothing for discrete sequences to an adaptable deletion rate that varies according to input properties such as sequence length. The paper provides a theoretical soundness proof for certification under variable rates, and experiments on NLP sequence classification tasks demonstrate improvements in certified region cardinality of up to 30 orders of magnitude.
Associative Syntax and Maximal Repetitions Reveal Context-Dependent Complexity in Fruit Bat Communication: This paper proposes an unsupervised approach for inferring discrete units, grammar types, and temporal structure from fruit bat vocalizations, and introduces Maximal Repetitions (MRs) to animal communication research for the first time, finding that communicative complexity is significantly higher in conflict contexts than in affiliative ones.
AudSemThinker: Enhancing Audio-Language Models through Reasoning over Semantics of Sound: AudSemThinker introduces a structured semantic reasoning framework for audio-language models by defining 9 categories of sound semantic descriptors (who/what/how/when/where, etc.). Built on Qwen2.5-Omni-7B and trained via SFT + GRPO (with verifiable rewards and length constraints), the model produces three-stage outputs in the format \<think>\<semantic_elements>\<answer>, achieving 66.70% on the MMAU benchmark—surpassing Audio-Reasoner (61.71%) and Qwen2.5-Omni (65.60%).
Benchmarking Egocentric Multimodal Goal Inference for Assistive Wearable Agents: Meta proposes WAGIBench, a multimodal goal inference benchmark for assistive wearable agents, comprising 3,477 egocentric recordings (29 hours) from 348 participants across four modalities — visual, audio, digital, and longitudinal. Human accuracy reaches 93% versus the best VLM at 84% (MCQ); under generative evaluation, models produce relevant goals only 55% of the time, exposing a substantial gap between current VLMs and real-world wearable deployment.
BNMusic: Blending Environmental Noises into Personalized Music: This paper proposes BNMusic, a two-stage framework that blends environmental noises into personalized generated music. Stage 1 generates rhythm-aligned music via mel-spectrogram outpainting and inpainting; Stage 2 adaptively amplifies the music signal based on auditory masking theory to reduce noise perception. The approach requires no additional training and significantly outperforms baselines on EPIC-SOUNDS and ESC-50.
Can LLMs Outshine Conventional Recommenders? A Comparative Evaluation: This paper proposes RecBench, a comprehensive evaluation framework that systematically compares 17 LLMs against 10 conventional DLRMs across 5 domain-specific datasets. Results show that LLM-based recommenders achieve up to 5% AUC improvement on CTR tasks and up to 170% NDCG@10 improvement on sequential recommendation, yet incur 10–1000× slower inference. Conventional DLRMs augmented with LLM semantic embeddings (LLM-for-RS) attain approximately 95% of LLM performance at 20× higher throughput, making this paradigm the most industrially viable solution at present.
Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for and with Foundation Models: Data-Juicer 2.0 is a cloud-scale multimodal data processing system for foundation models, featuring 150+ operators spanning text, image, video, and audio. It supports adaptive distributed execution (Ray/MaxCompute), efficiently processes TB-scale data on 10,000+ CPU cores, and has been widely adopted in products such as Alibaba Cloud PAI.
DeepASA: An Object-Oriented Multi-Purpose Network for Auditory Scene Analysis: This paper proposes DeepASA, an object-oriented multi-task unified architecture that simultaneously performs multi-channel source separation (MIMO), dereverberation, sound event detection (SED), audio classification, and direction-of-arrival estimation (DoAE) within a single model via object-oriented processing and a chain-of-inference mechanism, achieving state-of-the-art performance on multiple spatial audio benchmarks.
E-BATS: Efficient Backpropagation-Free Test-Time Adaptation for Speech Foundation Models: This paper proposes E-BATS, the first backpropagation-free test-time adaptation framework for speech foundation models. Through lightweight prompt adaptation, multi-scale loss functions, and a test-time EMA mechanism, E-BATS achieves 2.0×–6.4× GPU memory savings while maintaining competitive accuracy.
E2E-VGuard: Adversarial Prevention for Production LLM-based End-To-End Speech Synthesis: E2E-VGuard is a proactive defense framework against voice cloning threats in LLM-based end-to-end speech synthesis. It disrupts timbre via encoder ensemble perturbation, interferes with pronunciation recognition via adversarial attacks on ASR systems, and ensures imperceptibility through a psychoacoustic model. Effectiveness is validated across 19 TTS models and 7 ASR systems.
Echoes of Humanity: Exploring the Perceived Humanness of AI Music: Through a randomized controlled crossover trial (RCCT) and mixed-methods content analysis, this paper systematically investigates listeners' ability to distinguish AI-generated music (AIM) from human-created music. Results show that listeners perform at chance level under random pairing (accuracy ≈ random guessing), but accuracy rises significantly to 66% under similar pairing. Vocal, sound, and technical cues are identified as the key factors enabling successful discrimination.
Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space: This paper proposes SLED, which encodes speech waveforms into sequences of continuous latent representations and performs autoregressive modeling in the continuous space via an energy distance objective. This avoids the information loss from discretization and the complex hierarchical architectures required by RVQ, while enabling efficient zero-shot and streaming speech synthesis.
Ethics Statements in AI Music Papers: The Effective and the Ineffective: A systematic review of the current state of ethics statement usage in AI music research papers, finding that the vast majority of ethics statements are not effectively utilized, with actionable recommendations proposed for both conferences and researchers.
EuroSpeech: A Multilingual Speech Corpus: This paper presents a scalable, open-source pipeline for automatically constructing the EuroSpeech dataset from recordings of 22 European parliaments — yielding 61K hours of high-quality speech-text aligned data across 22 languages, with 19 languages exceeding 1K hours. Fine-tuning Whisper on this data reduces average WER by 41.8%.
From Generation to Attribution: Music AI Agent Architectures for the Post-Streaming Era: This paper proposes a content-based Music AI Agent architecture that decomposes music into fine-grained Block components and constructs an Attribution Layer, embedding copyright attribution directly into the AI music creation pipeline to establish a fair AI media platform for the post-streaming era.
Generating Physically Sound Designs from Text and a Set of Physical Constraints: This paper proposes TIDES, a framework that combines the visual guidance of pretrained text-image models (CLIP) with a differentiable finite-element physics simulator. By jointly optimizing a visual similarity loss and a structural compliance loss, TIDES generates load-bearing structural designs that satisfy both engineering performance requirements and text-specified visual characteristics, starting from text descriptions and physical constraints. The method is validated through 3D-printed three-point bending experiments.
Inductive Transfer Learning for Graph-Based Recommenders: This paper proposes NBF-Rec, a graph-based recommendation model built upon the Neural Bellman-Ford Network, which supports inductive transfer learning across datasets with completely disjoint users and items, enabling zero-shot cross-domain recommendation and lightweight fine-tuning adaptation.
Instance-Specific Test-Time Training for Speech Editing in the Wild: This paper proposes an instance-specific test-time training (TTT) method for in-the-wild speech editing. Prior to inference, the model is fine-tuned at the instance level using direct supervision from the acoustic features of unedited regions, and indirect supervision over edited regions via duration constraints and a phoneme prediction auxiliary loss. This approach effectively mitigates bandwidth discontinuity at editing boundaries, supports precise speaking-rate control through mask length adjustment, and surpasses existing systems on both objective and subjective metrics on an in-the-wild benchmark.
Latent Space Factorization in LoRA: This paper proposes FVAE-LoRA, which incorporates a VAE with dual latent spaces into the LoRA framework. Through a novel ELBO objective, it explicitly factorizes task-relevant features ($\mathbf{z}_1$) from residual information ($\mathbf{z}_2$), consistently outperforming standard LoRA across text, image, and audio tasks.
LeVo: High-Quality Song Generation with Multi-Preference Alignment: This paper proposes LeVo, a song generation framework that employs a language model to jointly model mixed tokens and dual-track tokens, thereby reconciling vocal-accompaniment harmony with audio quality. It further introduces a DPO-based multi-preference alignment method to enhance musicality and instruction-following capability.
LeVo: High-Quality Song Generation with Multi-Preference Alignment: LeVo proposes a language-model-based song generation framework that simultaneously optimizes vocal–accompaniment harmony and audio quality by predicting mixed tokens and dual-track tokens in parallel, and introduces a DPO-based multi-preference alignment method to enhance musicality and instruction-following ability. LeVo comprehensively outperforms all academic baselines and approaches the performance of industrial systems.
LUMIA: A Handheld Vision-to-Music System for Real-Time, Embodied Composition: This paper presents Lumia — a handheld camera-shaped device that analyzes captured frames via GPT-4 Vision to generate structured prompts, which are then fed to Stable Audio to synthesize loopable music segments, enabling a real-time, embodied improvisation workflow from visual input to music.
MEGADance: Mixture-of-Experts Architecture for Genre-Aware 3D Dance Generation: This paper proposes MEGADance, the first music-driven 3D dance generation method based on a Mixture-of-Experts (MoE) architecture. It decouples choreographic consistency into "dance universality" (Universal Expert) and "style specificity" (Specialized Expert), combined with FSQ quantization and a Mamba-Transformer hybrid backbone, achieving state-of-the-art dance quality and strong style controllability.
Merlin L48 Spectrogram Dataset: This paper introduces the L48 dataset — a fine-grained spectrogram multi-label classification benchmark derived from real-world bird recordings that naturally exhibits the Single Positive Multi-Label (SPML) setting. The dataset exposes critical shortcomings of existing SPML methods under realistic conditions, and proposes an intra-recording consistency regularization scheme to improve performance.
Mixed Monotonicity Reachability Analysis of Neural ODE: A Trade-Off Between Tightness and Efficiency: This paper applies continuous-time mixed monotonicity techniques to the reachability analysis of Neural ODEs. By embedding Neural ODE dynamics into a mixed monotone system, it exploits the geometric simplicity of interval boxes to achieve efficient over-approximation, providing a controllable trade-off between tightness and computational efficiency.
MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition: MoME integrates sparse MoE into the Matryoshka representation learning framework for LLM-based audio-visual speech recognition. Through a shared router, it enables cross-granularity knowledge transfer, supporting elastic inference at multiple compression rates under a single set of model weights, while achieving state-of-the-art performance on AVSR/ASR/VSR.
Multi-head Temporal Latent Attention: MTLA extends MLA's low-rank latent compression along the feature dimension by introducing a hyper-network that dynamically merges temporally adjacent KV vectors, achieving dual-axis compression of the KV cache across both feature and temporal dimensions. Combined with a stride-aware causal mask to ensure training–inference consistency, MTLA achieves 4.29× speedup and 6.58× memory reduction on speech translation and related tasks, with quality on par with or slightly exceeding standard MHA.
Node-Based Editing for Multimodal Generation of Text, Audio, Image, and Video: This paper proposes a node-graph-based story editing system that allows creators to iteratively generate, edit, and compare multimodal content (text, audio, image, and video) through natural language and node-level operations, supporting both linear and branching narrative structures.
Perceptually Aligning Representations of Music via Noise-Augmented Autoencoders: This paper demonstrates that applying noise augmentation to latent variables during autoencoder training, combined with a perceptual loss, induces a "perceptual hierarchy" in the encoding space — the most perceptually salient musical features (e.g., pitch) are encoded in the coarsest latent structures, while secondary features (e.g., timbral details) are encoded in finer structures. This alignment improves music surprisal estimation under latent diffusion decoding and enhances EEG brain response prediction.
Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers: This work systematically compares language model architectures via controlled synthetic pretraining tasks, and finds that the Canon layer—a lightweight component performing weighted summation over neighboring tokens—significantly enhances core capabilities including reasoning depth (2–4×), reasoning breadth, and knowledge capacity, enabling NoPE to match RoPE and GLA to rival Mamba2/GDN.
Resounding Acoustic Fields with Reciprocity: Leveraging the reciprocity principle of acoustic wave propagation, this paper proposes Versa (ELE data augmentation + SSL self-supervised learning), which generates physically valid virtual training samples by swapping emitter and receiver roles, substantially improving acoustic field estimation performance under sparse emitter configurations.
SAND-Math: Using LLMs to Generate Novel, Difficult and Useful Mathematics Questions and Answers: This paper proposes SAND-Math, a fully automated synthetic mathematics question generation pipeline that requires no seed dataset. By employing Difficulty Hiking to systematically increase problem difficulty, augmenting the LIMO baseline with as few as 500 problems yields a 4.39pp improvement on AIME25.
Seeing Sound, Hearing Sight: Uncovering Modality Bias and Conflict of AI Models in Sound Localization: Through six controlled audio-visual conditions and human psychophysical experiments, this work systematically reveals that existing AI sound source localization (SSL) models suffer from severe visual bias—degrading to near-random performance under audio-visual conflict—and proposes EchoPin, a neuroscience-inspired model combining HRTF filtering, ERB cochleagram representation, and stereo audio. EchoPin substantially outperforms prior methods on the newly constructed AudioCOCO dataset and, without any human behavioral supervision, exhibits a human-like horizontal-over-vertical localization accuracy asymmetry.
Segment-Factorized Full-Song Generation on Symbolic Piano Music: This paper proposes the Segmented Full-Song (SFS) model, which decomposes a song into segments and autoregressively generates each segment by attending selectively to structurally relevant context. SFS achieves faster and more structurally coherent full-song piano generation compared to existing methods, while supporting interactive human-AI co-creation.
Sensorium Arc: AI Agent System for Oceanic Data Exploration and Interactive Eco-Art: This paper presents Sensorium Arc, a multimodal interactive AI agent system that personifies the ocean as a poetic "narrator" figure. Leveraging a multi-agent RAG architecture, the system integrates NASA ocean science data with eco-aesthetic texts, enabling users to explore complex marine environmental data through natural conversation while dynamically generating scientific visualizations and artistic audiovisual feedback—realizing a paradigm shift from "passive data observation" to "active ecological dialogue."
SHAP Meets Tensor Networks: Provably Tractable Explanations with Parallelism: This paper presents the first provably exact SHAP computation framework for Tensor Networks (TNs), proves that SHAP under the Tensor Train (TT) structure is parallelizable in polylogarithmic time (NC² complexity), and reveals via reductions that width rather than depth is the fundamental bottleneck for SHAP computation in binarized neural networks.
SimulMEGA: MoE Routers are Advanced Policy Makers for Simultaneous Speech Translation: This paper proposes SimulMEGA, a framework combining prefix training with a Mixture-of-Experts (MoE) refinement module to achieve unsupervised read/write policy learning. A 500M-parameter model achieves <7% BLEU degradation at 1.5-second latency across simultaneous speech translation in 6 languages, and extends to streaming TTS.
Slimmable NAM: Neural Amp Models with Adjustable Runtime Computational Cost: This paper applies the Slimmable Networks paradigm to the Neural Amp Modeler (NAM) by randomly pruning WaveNet layer widths during training, enabling dynamic adjustment of network size at inference time without additional training cost, allowing musicians to balance audio fidelity and computational expense in real time.
Sound Logical Explanations for Mean Aggregation Graph Neural Networks: For GNNs using mean aggregation (MAGNN, i.e., mean-GNNs with non-negative weights), this work precisely characterizes the class of monotone logical rules that can serve as sound explanations, constructs a restricted fragment of first-order logic to explain arbitrary MAGNN predictions, and empirically demonstrates that restricting to non-negative weights does not significantly hurt performance while enabling effective extraction of sound rules.
Target Speaker Extraction Through Comparing Noisy Positive and Negative Audio Enrollments: This paper proposes a novel enrollment strategy that encodes target speaker characteristics by contrasting noisy positive enrollments (segments where the target speaker is active) against negative enrollments (segments where the target speaker is silent), achieving state-of-the-art performance on monaural noisy-enrollment target speaker extraction with SI-SNRi exceeding the previous best method by over 2.1 dB.
AVRobustBench: Benchmarking the Robustness of Audio-Visual Recognition Models at Test-Time: This paper proposes AVRobustBench, the first benchmark that systematically evaluates the test-time robustness of audio-visual models under co-occurring correlated dual-modality corruptions, comprising 4 datasets × 75 corruption types, and introduces AV2C, a TTA method based on low-entropy sample selection.
The Impact of Scaling Training Data on Adversarial Robustness: A systematic evaluation of 36 state-of-the-art vision models under 6 categories of black-box attacks reveals that attack success rate (ASR) decreases logarithmically with training data volume and model scale; however, data quality and model scale are more critical than data volume alone.
Unifying Symbolic Music Arrangement: Track-Aware Reconstruction and Structured Tokenization: This paper proposes a unified symbolic music arrangement framework that employs a segment-level self-supervised reconstruction objective (decoupling content from instrument style) and a novel multi-track tokenization scheme REMI-z, enabling a single pretrained model to handle diverse arrangement tasks—including orchestral arrangement, piano reduction, and drum arrangement—while surpassing task-specific state-of-the-art methods across all three benchmarks.
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction: VITA-1.5 proposes a carefully designed three-stage progressive training strategy that incrementally integrates visual and speech capabilities into an LLM, achieving end-to-end vision-speech real-time interaction without relying on standalone ASR/TTS modules, while attaining state-of-the-art performance among open-source models on image, video, and speech benchmarks.
WhAM: Towards A Translative Model of Sperm Whale Vocalization: This paper proposes WhAM (Whale Acoustics Model), the first Transformer-based generative model for sperm whale codas, achieving acoustic translation, synthetic generation, and downstream classification through fine-tuning of VampNet.

✂️ Segmentation¶

Alligat0R: Pre-Training through Covisibility Segmentation for Relative Camera Pose Regression: This paper replaces CroCo's cross-view completion with covisibility segmentation as a stereo vision pre-training task, predicting per-pixel labels of "co-visible / occluded / out-of-view" for each pixel. The approach significantly outperforms CroCo in low-overlap scenarios and achieves a first-place overall success rate of 60.3% on the RUBIK benchmark.
ARGenSeg: Image Segmentation with Autoregressive Image Generation Model: This paper proposes ARGenSeg — the first unified MLLM framework that leverages the autoregressive image generation paradigm for image segmentation. The model directly outputs visual tokens decoded by a VQ-VAE into segmentation masks, requiring no additional segmentation head. A next-scale prediction parallel generation strategy enables 4× inference speedup, and the method surpasses state-of-the-art on RefCOCO/+/g with significantly less training data.
Attention (as Discrete-Time Markov) Chains: This work reinterprets the softmax-normalized attention matrix as the transition probability matrix of a Discrete-Time Markov Chain (DTMC), and proposes Multi-Bounce Attention and TokenRank (stationary distribution, analogous to PageRank) to capture indirect attention paths and global token importance. The approach achieves 94.29% mAP on ImageNet segmentation and enhances image generation quality in Self-Attention Guidance.
ConnectomeBench: Can LLMs Proofread the Connectome?: This paper introduces ConnectomeBench, the first standardized benchmark for evaluating multimodal LLMs on three key connectomics proofreading tasks: segment identification, split error correction, and merge error detection. o4-mini achieves 85% on the split correction multiple-choice task, yet merge error detection remains significantly below human expert performance.
COS3D: Collaborative Open-Vocabulary 3D Segmentation: This paper proposes COS3D — a collaborative prompt-segmentation framework that constructs a collaborative field comprising an instance field and a language field. During training, the language field is built via instance-to-language feature mapping; during inference, language-to-instance adaptive prompt refinement generates precise segmentation results. COS3D substantially outperforms existing methods on two mainstream benchmarks.
Diffusion-Driven Two-Stage Active Learning for Low-Budget Semantic Segmentation: A two-stage active learning pipeline (coverage → uncertainty) is proposed, leveraging multi-scale features from pretrained diffusion models to achieve efficient semantic segmentation under extremely low annotation budgets.
Exploring Structural Degradation in Dense Representations for Self-supervised Learning: This paper identifies and systematically investigates the Self-supervised Dense Degradation (SDD) phenomenon — where longer training improves classification yet hurts dense task performance — and proposes the DSE metric along with DSE-guided model selection and regularization strategies, achieving an average mIoU improvement of 3.0%.
Fast and Fluent Diffusion Language Models via Convolutional Decoding and Rejective Fine-tuning: By introducing convolutional decoding normalization (replacing hard semi-autoregressive chunking) and rule-based rejective fine-tuning (R2FT), the proposed method achieves generation quality at 128 inference steps comparable to 512+ steps, reaching state-of-the-art performance among diffusion language models (DLMs).
FAST: Foreground-aware Diffusion with Accelerated Sampling Trajectory for Segmentation-oriented Anomaly Synthesis: FAST introduces explicit mechanisms to preserve anomaly regions throughout the diffusion trajectory: AIAS compresses the multi-step reverse process of discrete diffusion into a small number of coarse-to-fine analytical updates, while FARM reconstructs and reinjects anomaly foregrounds at each step, yielding a method that is both fast and better suited for generating training data for downstream anomaly segmentation models.
FineRS: Fine-grained Reasoning and Segmentation of Small Objects with Reinforcement Learning: FineRS is a two-stage MLLM reinforcement learning framework comprising Global Semantic Exploration (GSE) and Localized Perceptual Refinement (LPR), coupled via a locate-informed retrospective reward. Evaluated on the newly constructed FineRS-4k UAV high-resolution dataset, it achieves reasoning and segmentation of ultra-small objects with a gIoU of 55.1% (surpassing Seg-Zero† by 8.5%) while simultaneously supporting VQA (MVQA 83.3%).
GTPBD: A Fine-Grained Global Terraced Parcel and Boundary Dataset: This paper introduces GTPBD, the first fine-grained global terraced parcel and boundary dataset, comprising 47,537 high-resolution images (0.5–0.7 m) with over 200,000 manually annotated parcels. It provides three-level labels supporting four tasks—semantic segmentation, edge detection, agricultural parcel extraction, and unsupervised domain adaptation—and presents comprehensive benchmarks across 20 methods.
HAODiff: Human-Aware One-Step Diffusion via Dual-Prompt Guidance: This paper proposes HAODiff, a human-aware one-step diffusion model that generates adaptive positive–negative prompt pairs via a three-branch Dual-Prompt Guidance (DPG) module. Combined with an explicit Human Motion Blur (HMB) degradation pipeline and Classifier-Free Guidance (CFG), HAODiff substantially outperforms existing state-of-the-art methods on human image restoration tasks.
HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios: This paper introduces the Referring Human Action Segmentation (RHAS) task—localizing a specific individual in multi-person videos via textual descriptions and performing frame-level action segmentation. The authors construct the RHAS133 dataset comprising 133 movies, 137 action categories, and 33 hours of video, and propose HopaDIFF, a holistic-partial aware Fourier-conditioned diffusion framework that substantially outperforms existing baselines across multiple evaluation settings.
HumanCrafter: Synergizing Generalizable Human Reconstruction and Semantic 3D Segmentation: HumanCrafter is proposed as the first feed-forward framework that unifies single-image 3D human reconstruction with body-part semantic segmentation. A human geometry prior-guided Transformer aggregates multi-view features, while DINOv2 self-supervised semantic priors construct a 3D feature field. The method simultaneously surpasses existing SOTA in both 3D reconstruction and segmentation on 2K2K and THuman2.1.
InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition: This paper introduces a new task — Instruction-oriented Counting, Detection, and Segmentation (InstructCDS) — along with the EarthInstruct remote sensing benchmark covering three settings (open-vocabulary, open-ended, and open-subcategory). It proposes InstructSAM, a training-free framework that uses an LVLM to parse instructions and predict counts, SAM2 to generate mask proposals, and CLIP to compute similarities. A Binary Integer Programming (BIP) formulation then performs optimal mask-label assignment under counting constraints, achieving near-constant inference time while outperforming task-specific baselines.
Interpreting ResNet-based CLIP via Neuron-Attention Decomposition: This paper proposes a neuron-attention decomposition method to interpret CLIP-ResNet by decomposing model outputs into pairwise contribution paths of neurons and attention pooling heads. The resulting neuron-head pairs are shown to admit rank-1 approximations, exhibit sparsity, and capture sub-concepts. The method is applied to training-free semantic segmentation (mIoU 26.2% on PASCAL Context, surpassing MaskCLIP by 15%) and dataset distribution shift monitoring.
LangHOPS: Language Grounded Hierarchical Open-Vocabulary Part Segmentation: LangHOPS is the first open-vocabulary object-part instance segmentation framework based on a multimodal large language model (MLLM). It establishes object-part hierarchical relationships in language space and leverages the knowledge and reasoning capabilities of MLLMs to bridge multi-granularity concepts. It achieves 56.9% AP on PartImageNet, surpassing the previous SOTA by 5.5%, and outperforms by 4.8% in cross-dataset settings.
Mars-Bench: A Benchmark for Evaluating Foundation Models for Mars Science Tasks: This paper presents Mars-Bench — the first comprehensive benchmark for Mars science tasks, encompassing 20 datasets across three task types (classification, segmentation, and object detection). It systematically evaluates ImageNet-pretrained models, Earth observation foundation models, and vision-language models on Martian data, revealing significant gaps in current general-purpose models and calling for the development of Mars-specific foundation models.
Mechanistic Interpretability of RNNs Emulating Hidden Markov Models: A vanilla RNN is trained to reproduce the emission statistics of an HMM; reverse engineering then reveals the mechanism by which the RNN implements discrete stochastic state transitions: noise-driven orbital dynamics combined with rapid transitions triggered by "kick neurons." The underlying principle is self-induced stochastic resonance (SISR), and this dynamical motif can be composed and reused to emulate more complex discrete latent structures.
Mechanistic Interpretability of RNNs Emulating Hidden Markov Models: By training RNNs to emulate the emission statistics of HMMs, then reverse-engineering the learned solutions, this work reveals how RNNs exploit noise-driven orbital dynamics, structured connectivity (noise-integrating populations + kick neurons), and self-induced stochastic resonance to implement discrete stochastic state transitions.
MultiHuman-Testbench: Benchmarking Image Generation for Multiple Humans: This paper introduces MultiHuman-Testbench, the first systematic benchmark for evaluating multi-human image generation. It comprises 1,800 test samples paired with 5,550 face images, a suite of multi-dimensional evaluation metrics including Hungarian-matching-based identity similarity, and proposes Regional Isolation and Implicit Region Assignment techniques to enhance existing methods without additional training.
Novel Class Discovery for Point Cloud Segmentation via Joint Learning of Causal Representation and Reasoning: This paper is the first to introduce causal learning into 3D point cloud novel class discovery (3D-NCD). By leveraging a Structural Causal Model (SCM) to analyze confounders in base classes and causal relationships between base and novel classes, it proposes Causal Representation Prototype learning (CRP, which eliminates confounders via an adversarial network) and graph-based causal reasoning (GCN-based pseudo-label generation), achieving state-of-the-art results on SemanticKITTI and SemanticPOSS.
OmniSegmentor: A Flexible Multi-Modal Learning Framework for Semantic Segmentation: OmniSegmentor constructs a large-scale ImageNeXt dataset encompassing 5 visual modalities (1.2M samples), proposes an efficient pretraining strategy that randomly selects one supplementary modality to align with RGB per iteration, and establishes the first flexible multi-modal pretrain-finetune pipeline, achieving state-of-the-art results on 6 multi-modal semantic segmentation benchmarks.
Panoptic Captioning: An Equivalence Bridge for Image and Text: This paper proposes the novel task of Panoptic Captioning, which pursues a minimum text equivalence of images—defining a comprehensive structured description along five dimensions (entity semantic tags, locations via bounding boxes, attributes, relations, and global state)—and introduces the PancapEngine data engine and PancapChain decoupled multi-stage method. A 13B model trained under this framework surpasses InternVL-2.5-78B and GPT-4o.
PartNeXt: A Next-Generation Dataset for Fine-Grained and Hierarchical 3D Part Understanding: This paper presents PartNeXt, a fine-grained hierarchical part annotation dataset comprising 23,519 high-quality textured 3D models across 50 categories. Two benchmarks are established—category-agnostic part segmentation and 3D part question answering—revealing significant deficiencies of current methods in fine-grained part understanding.
PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding: This paper introduces the Partonomy part-level segmentation benchmark (862 part labels / 534 object labels) and the Plum model, which replaces the [SEG] token with BIO span tagging and incorporates a mask feedback loop. The study reveals that state-of-the-art segmentation LMMs achieve only 5.9% gIoU on part understanding; Plum achieves significant improvements by avoiding distribution shift and leveraging historical predictions.
Re-coding for Uncertainties: Edge-awareness Semantic Concordance for Resilient Event-RGB Segmentation: This paper proposes the Edge-awareness Semantic Concordance (ESC) framework, which leverages semantic edges as an intermediate bridge between heterogeneous Event and RGB modalities. Through discrete latent space modeling via an edge dictionary, ESC achieves cross-modal feature alignment and uncertainty optimization, surpassing the state of the art by 2.55% mIoU under extreme conditions.
HCLFuse: Revisiting Generative Infrared and Visible Image Fusion Based on Human Cognitive Laws: HCLFuse performs modality alignment via the information bottleneck principle and optimal transport theory, combining a Variational Bottleneck Encoder (VBE) with a physics-guided conditional diffusion model. Three physical constraints—heat conduction, structure preservation, and physical consistency—are injected into the diffusion process. On the MSRS dataset, the gradient metric AG improves by 69.87% and spatial frequency SF improves by 39.41%.
Robust Ego-Exo Correspondence with Long-Term Memory: This paper proposes LM-EEC, a SAM 2-based cross-view video object segmentation framework for ego-exo correspondence. It introduces a Memory-View MoE (MV-MoE) module to adaptively fuse memory features with cross-view features, coupled with a dual memory bank compression strategy for retaining long-term information. LM-EEC substantially outperforms existing methods on the EgoExo4D benchmark (Ego2Exo IoU: 54.98 vs. 38.26).
Robust Egocentric Referring Video Object Segmentation via Dual-Modal Causal Intervention: This paper proposes CERES, a framework that addresses language bias and visual confusion in egocentric referring video object segmentation (Ego-RVOS) via dual-modal causal intervention — language backdoor adjustment to eliminate dataset statistical bias, and depth-guided visual frontdoor adjustment to construct causal mediators — achieving SOTA on VISOR/VOST/VSCOS.
RoMA: Scaling up Mamba-based Foundation Models for Remote Sensing: This paper proposes RoMA — the first self-supervised autoregressive pre-training framework based on the Mamba architecture for remote sensing. By introducing an adaptive rotation encoding strategy and a multi-scale token prediction mechanism, RoMA addresses the challenges of orientation diversity and extreme scale variation inherent in remote sensing imagery, while empirically validating that Mamba follows data and parameter scaling laws in the remote sensing domain.
SaFiRe: Saccade-Fixation Reiteration with Mamba for Referring Image Segmentation: SaFiRe is a framework that simulates the human two-stage "saccade-fixation" cognitive process, leveraging Mamba's scan-then-update mechanism to achieve linear-complexity multi-round refinement for referring image segmentation under ambiguous expressions.
SAM-R1: Leveraging SAM for Reward Feedback in Multimodal Segmentation via Reinforcement Learning: SAM-R1 proposes an end-to-end reasoning segmentation framework that, for the first time, incorporates SAM as a reward provider within the reinforcement learning training loop. Combined with a tiered IoU accuracy reward, asymmetric clipping, and token-level loss normalization in an improved GRPO algorithm, the method achieves a gIoU of 60.2% on the ReasonSeg zero-shot benchmark—surpassing Seg-Zero and other approaches—using only 3K training samples.
SANSA: Unleashing the Hidden Semantics in SAM2 for Few-Shot Segmentation: SANSA reveals that SAM2, despite being pre-trained in a class-agnostic manner, implicitly encodes rich semantic structure in its features. By inserting lightweight AdaptFormer adapters into the last two layers of a frozen SAM2 Image Encoder, the method redirects the Memory Attention mechanism from visual-similarity matching to semantic-similarity matching. This unified architecture achieves state-of-the-art performance on few-shot segmentation while being more than 3× faster and 4–5× smaller in parameter count than competing approaches.
Seg-VAR: Image Segmentation with Visual Autoregressive Modeling: Seg-VAR reformulates image segmentation as a conditional autoregressive mask generation problem. By introducing seglat (a latent representation of segmentation masks) and spatial-aware color mapping, it encodes segmentation masks into discrete tokens processable by a VAR model. Seg-VAR comprehensively outperforms discriminative methods such as Mask2Former and generative methods such as GSS across semantic, instance, and panoptic segmentation tasks on COCO, Cityscapes, and ADE20K.
Seg4Diff: Unveiling Open-Vocabulary Segmentation in Text-to-Image Diffusion Transformers: Through systematic analysis of the joint attention mechanism in Multimodal Diffusion Transformers (MM-DiT), this paper identifies specific layers ("semantic localization expert layers") that inherently possess high-quality semantic segmentation capability, and proposes a lightweight fine-tuning method, MAGNET, that simultaneously improves both segmentation and generation performance.
Self-supervised Synthetic Pretraining for Inference of Stellar Mass Embedded in Dense Gas: This paper proposes a "synthetic data-driven self-supervised pretraining" paradigm: one million synthetic fractal images are first generated via the Flame algorithm to pretrain a ViT-L/16 encoder using the DINOv2 framework; the frozen encoder is then transferred directly to an extremely limited set of magnetohydrodynamic (MHD) star-formation simulation data, achieving stellar mass prediction via kNN regression ($R^2 = 0.81$) and zero-shot unsupervised semantic segmentation via PCA projection—slightly outperforming a fully supervised ResNet-18 baseline trained on the same data.
SRSR: Enhancing Semantic Accuracy in Real-World Image Super-Resolution with Spatially Re-Focused Text-Conditioning: SRSR proposes a training-free plug-and-play framework that addresses semantic hallucination caused by text guidance in diffusion-based super-resolution methods. It introduces two inference-time modules—Spatially Re-focused Cross-Attention (SRCA) and Spatially Targeted CFG (STCFG)—and comprehensively outperforms 7 SOTA baselines in both fidelity and perceptual quality.
STEAD: Robust Provably Secure Linguistic Steganography with Diffusion Language Model: This paper proposes STEAD, the first provably secure and robust linguistic steganography method based on diffusion language models (DLMs). It exploits the parallel denoising property of DLMs to identify "robust positions" for message embedding, and combines repetitive error-correcting codes with a neighborhood search strategy to resist token-level substitution, insertion, and deletion attacks.
STEP: A Unified Spiking Transformer Evaluation Platform for Fair and Reproducible Benchmarking: STEP is the first unified evaluation platform for Spiking Transformers (STs), supporting multi-task benchmarking (classification/segmentation/detection), multiple backends (SpikingJelly/BrainCog/BrainPy). Through systematic ablation, it reveals that current STs rely heavily on convolutional frontends, that attention contributes minimally, and that temporal modeling capacity is insufficient. The platform further proposes a unified energy consumption analysis framework accounting for bit-width sparsity and memory access costs.
TabRAG: Improving Tabular Document Question Answering for Retrieval Augmented Generation via Structured Representations: This paper proposes TabRAG, a parsing-based RAG framework that decomposes documents into fine-grained components via layout segmentation, extracts tables into hierarchical structured representations using vision-language models, and integrates a self-generated in-context learning module to adapt to diverse table formats, achieving comprehensive improvements over existing parsing techniques on tabular document question answering.
Torch-Uncertainty: A Deep Learning Framework for Uncertainty Quantification: Torch-Uncertainty is the first unified, scalable, domain-agnostic, and evaluation-centric PyTorch/Lightning framework for uncertainty quantification (UQ), integrating 6 major UQ method families, 26 evaluation metrics, and 27 plug-and-play datasets across classification, segmentation, and regression tasks, along with comprehensive benchmark results.
Towards Robust Pseudo-Label Learning in Semantic Segmentation: An Encoding Perspective: This paper proposes ECOCSeg, which replaces one-hot encoding with Error-Correcting Output Codes (ECOC) to represent semantic categories. It decomposes an N-class classification problem into K binary sub-tasks, and couples bit-level pseudo-label denoising with customized optimization losses to substantially improve the robustness of pseudo-label learning in UDA and SSL semantic segmentation.
Towards Unsupervised Domain Bridging via Image Degradation in Semantic Segmentation: This paper proposes DiDA, which formalizes image degradation operations as the forward process of diffusion models to construct a continuous intermediate domain between the source and target domains. Combined with a semantic shift compensation mechanism, DiDA serves as a plug-and-play module that consistently improves existing UDA semantic segmentation methods.
UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning: UniPixel proposes the first end-to-end large multimodal model that unifies object referring and segmentation, leveraging a novel Object Memory Bank design to transform sparse visual prompts into dense object mask features and inject them into the reasoning process. The model achieves state-of-the-art performance on 10 benchmarks and introduces PixelQA, a new task requiring simultaneous referring, segmentation, and question answering.
Unveiling the Spatial-Temporal Effective Receptive Fields of Spiking Neural Networks: This paper proposes a Spatial-Temporal Effective Receptive Field (ST-ERF) analysis framework to diagnose the bottleneck of Transformer-based SNNs in visual long-sequence modeling—namely, the lack of a global receptive field—and accordingly designs two channel mixers, MLPixer and SRB, to enhance the global modeling capability of SNNs.
Vanish into Thin Air: Cross-prompt Universal Adversarial Attacks for SAM2: This paper proposes UAP-SAM2—the first cross-prompt universal adversarial attack against SAM2—which employs a dual semantic shift framework (intra-frame semantic confusion + inter-frame semantic inconsistency) to generate a single universal perturbation that causes segmentation targets to "vanish" across different videos, frames, and prompts.
Vision Transformers with Self-Distilled Registers: This paper proposes PH-Reg (Post Hoc Registers), an efficient self-distillation approach that retrofits register tokens into existing pretrained ViTs without labeled data or full retraining. By combining test-time augmentation-based teacher feature denoising with student self-distillation, PH-Reg effectively eliminates artifact tokens in ViT dense features, improving performance on segmentation and depth estimation.

🔄 Self-Supervised Learning¶

A Joint Learning Approach to Hardware Caching and Prefetching: This paper proposes a joint training framework that unifies hardware cache replacement and prefetching policies. By constructing shared feature representations via a joint encoder and contrastive learning, the framework breaks the performance bottleneck imposed by independently trained policies.
Adv-SSL: Adversarial Self-Supervised Representation Learning with Theoretical Guarantees: This paper proposes Adv-SSL, which rewrites the Frobenius norm of the covariance regularization term as a minimax dual form, eliminating the biased sample-level risk estimation present in methods such as Barlow Twins. The approach substantially improves downstream classification performance without incurring additional computational cost, and provides end-to-end theoretical convergence guarantees.
Angular Constraint Embedding via SpherePair Loss for Constrained Clustering: This paper proposes the SpherePair loss function, which performs pairwise constraint embedding learning in angular space (rather than Euclidean space), enabling a deep constrained clustering method that requires neither anchors nor prior knowledge of the number of clusters, while providing rigorous theoretical guarantees for determining optimal hyperparameters.
Asymptotic and Finite-Time Guarantees for Langevin-Based Temperature Annealing in InfoNCE: By modeling embedding evolution as Langevin dynamics on a compact Riemannian manifold, this paper proves that the convergence guarantees of classical simulated annealing extend to the temperature scheduling setting in contrastive learning: a sufficiently slow logarithmic inverse-temperature schedule guarantees probabilistic convergence to the globally optimal representation set, whereas faster schedules risk trapping the system in suboptimal minima.
BrainOmni: A Brain Foundation Model for Unified EEG and MEG Signals: This paper proposes BrainOmni—the first brain signal foundation model that unifies EEG and MEG—by discretizing heterogeneous brain signals into a unified token space via BrainTokenizer (incorporating a physical Sensor Encoder), followed by self-supervised masked prediction pretraining with a Criss-Cross Transformer. The model achieves an 11.7 percentage-point improvement on Alzheimer's disease detection and demonstrates zero-shot reconstruction generalization to completely unseen devices.
Connecting Jensen-Shannon and Kullback-Leibler Divergences: A New Bound for Representation Learning: This paper derives the optimal tight lower bound of KL divergence in terms of JS divergence, $\Xi(D_{\text{JS}}) \leq D_{\text{KL}}$, in the general case. It proves that training a discriminator by minimizing cross-entropy loss is equivalent to maximizing a guaranteed lower bound on mutual information, thereby providing the missing theoretical foundation for JSD-based discriminative representation learning methods. The tightness and practical utility of the bound are validated in MI estimation and the Information Bottleneck framework.
Continuous Subspace Optimization for Continual Learning (CoSO): This paper proposes CoSO, a framework that dynamically derives continuous subspaces from per-step gradient SVD (rather than LoRA's fixed subspace), combined with orthogonal projection onto historical task subspaces to prevent interference and Frequent Directions for efficient gradient information aggregation. CoSO achieves 78.19% final accuracy on ImageNet-R with 20 tasks, surpassing the best baseline by 2.77 percentage points.
Contrastive Consolidation of Top-Down Modulations Achieves Sparsely Supervised Continual Learning: This paper proposes Task-Modulated Contrastive Learning (TMCL), inspired by top-down modulation in the neocortex. In continual learning, sparse label information (as little as 1% labels) is integrated via affine modulation, and contrastive learning is then used to consolidate the modulation information into feedforward weights. TMCL surpasses both unsupervised and supervised baselines on class-incremental learning and transfer learning benchmarks.
Contrastive Representations for Temporal Reasoning: This paper proposes CRTR (Contrastive Representations for Temporal Reasoning), which introduces intra-trajectory negative pairs by repeating trajectory IDs within training batches. This eliminates the reliance on static contextual features in standard temporal contrastive learning, enabling representations that reflect temporal structure. CRTR achieves, for the first time, search-free solving on combinatorial reasoning tasks such as the Rubik's Cube.
Curiosity-driven RL for Symbolic Equation Solving: This work combines curiosity-driven exploration mechanisms (RND, ICM, etc.) with a graph action space based on expression trees, enabling a PPO agent to solve nonlinear equations involving radicals, exponentials, and trigonometric functions — surpassing prior RL methods that were limited to linear equations.
DataRater: Meta-Learned Dataset Curation: This paper proposes DataRater, a meta-gradient-based data valuation framework that employs meta-learning to automatically score and filter low-quality training samples. It achieves up to 46.6% net compute savings across multiple pre-training datasets, and a DataRater trained on a 400M internal model generalizes directly to LLM training at scales ranging from 50M to 1B parameters.
Disentangling Hyperedges through the Lens of Category Theory: This work is the first to analyze hyperedge disentanglement through the lens of category theory. By deriving a naturality condition, it establishes a "factor representation consistency" criterion (aggregation-then-disentanglement vs. disentanglement-then-aggregation should yield consistent results), and proposes Natural-HNN, which comprehensively outperforms 14 baselines across 6 cancer subtype classification datasets (BRCA F1: 75.7% → 80.4%) while achieving 100% accuracy in capturing the functional context of genetic pathways.
Foundation Models for Scientific Discovery: From Paradigm Enhancement to Paradigm Transition: This paper proposes a three-stage framework (meta-scientific integration → hybrid human-AI co-creation → autonomous scientific discovery) to characterize how foundation models are driving a transition in scientific paradigms from tool-based enhancement toward paradigm-level transformation. It also provides a systematic survey of FM integration across the four classical scientific paradigms: experimental, theoretical, computational, and data-driven.
Hybrid Autoencoders for Tabular Data: Leveraging Model-Based Augmentation in Low-Label Settings: This paper proposes TANDEM (Tree-And-Neural Dual Encoder Model), a hybrid autoencoder architecture that jointly trains a neural network encoder and an Oblivious Soft Decision Tree (OSDT) encoder, and introduces a sample-level stochastic gating network as a learnable data augmentation mechanism. TANDEM achieves superior performance over strong baselines—including tree-based and deep learning methods—in low-label tabular data settings.
Implicit Modeling for Transferability Estimation of Vision Foundation Models: This paper proposes Implicit Transferability Modeling (ITM), a framework that encodes the transferability of model–task pairs via a latent variable $z$, and employs Divide-and-conquer Variational Approximation (DVA) to efficiently simulate embedding space evolution. On 10 downstream tasks with 10 diverse pre-trained models, the weighted Kendall $\tau_w$ improves from the previous state-of-the-art of 0.45 to 0.61.
Know Thyself by Knowing Others: Learning Neuron Identity from Population Context: This paper proposes NuCLR, a self-supervised framework that learns neuron-level representations enriched with population context via contrastive learning—pulling together different temporal windows of the same neuron and pushing apart different neurons within a population. NuCLR achieves new state-of-the-art performance on cell type and brain region decoding, and is the first to demonstrate cross-animal zero-shot generalization and data scaling laws in this domain.
Long-Tailed Recognition via Information-Preservable Two-Stage Learning: This paper proposes an information-preservable two-stage learning framework: Stage 1 employs Balanced Negative Sampling (BNS) to learn an effective and separable feature space via mutual information maximization; Stage 2 uses Information-Preservable DPP (IP-DPP) to sample the most informative examples in a mathematically principled manner to correct majority-biased decision boundaries. The method achieves state-of-the-art performance on multiple long-tailed benchmarks.
M-GRPO: Stabilizing Self-Supervised Reinforcement Learning for Large Language Models with Momentum-Anchored Policy Optimization: To address policy collapse and entropy collapse in self-supervised reinforcement learning for LLMs, this paper proposes M-GRPO, a momentum-anchored GRPO framework combined with an IQR-based low-entropy trajectory filtering method, achieving stable training and state-of-the-art performance.
M-GRPO: Stabilizing Self-Supervised Reinforcement Learning for Large Language Models with Momentum-Anchored Policy Optimization: To address the pervasive "policy collapse" problem in self-supervised reinforcement learning from verifiable rewards (SS-RLVR) during extended training, this paper proposes M-GRPO: a framework that employs a momentum model to provide stable pseudo-label targets alongside IQR-based low-entropy trajectory filtering to prevent entropy collapse. Training Qwen3-4B-Base on unlabeled MATH data, the final checkpoint directly surpasses the manually selected best checkpoint of SRT, achieving +2.92% on AIME24 and +5.05% on GPQA.
Manifolds and Modules: How Function Develops in a Neural Foundation Model: This work opens the "black box" of a state-of-the-art neural activity foundation model (FNN) from a computational neuroscience perspective. By constructing decoding and encoding manifolds, the study reveals that each processing module (encoder, recurrent, readout) exhibits qualitatively distinct representational structures, and identifies critical discrepancies between the model and the biological visual system.
Memory-Integrated Reconfigurable Adapters: A Unified Framework for Settings with Multiple Tasks: MIRA embeds Hopfield-style associative memory modules into each layer of a ViT, storing and retrieving LoRA adapter weights as key-value pairs. Through a two-stage training procedure (adaptation + consolidation), it simultaneously addresses domain generalization (DG), class-incremental learning (CIL), and domain-incremental learning (DIL) within a single unified architecture, significantly outperforming task-specific methods across multiple benchmarks.
Minimal Semantic Sufficiency Meets Unsupervised Domain Generalization: MS-UDG operates without class or domain labels, decomposing representations into semantic and variation components via an Information Disentanglement Module (IDM). Coupled with a Semantic Representation Optimization Module (SROM) that simultaneously maximizes semantic information and minimizes variation interference, the method achieves 72.89% accuracy on PACS (+1.5% vs. CycleMAE). Theoretical analysis proves that minimally sufficient semantic representations minimize the downstream Bayes error rate.
Mitra: Mixed Synthetic Priors for Enhancing Tabular Foundation Models: This paper presents the first systematic study of design principles for synthetic priors, identifying diversity, distinctiveness, and real-data alignment as critical attributes. Based on these findings, the authors propose Mitra — a tabular foundation model trained on a carefully selected mixture of synthetic priors — which consistently outperforms TabPFNv2 and TabICL on both classification and regression benchmarks.
One Filters All: A Generalist Filter for State Estimation: This paper proposes LLM-Filter, which reprograms a large language model (LLM) as a generalist state estimator. Through a System-as-Prompt (SaP) mechanism, the frozen LLM achieves zero-shot generalization to unseen dynamical systems, surpassing state-of-the-art learning-based filters.
SEAL: Semantic-Aware Hierarchical Learning for Generalized Category Discovery: This paper proposes SEAL, a framework that leverages naturally occurring semantic hierarchies (rather than manually constructed abstract hierarchies) to guide generalized category discovery. Through hierarchically semantic-guided soft contrastive learning and a cross-granularity consistency module, SEAL achieves state-of-the-art performance on fine-grained benchmarks.
Soft Task-Aware Routing of Experts for Equivariant Representation Learning: This paper proposes STAR (Soft Task-Aware Routing), which employs a MoE routing mechanism to coordinate shared and task-specific information between invariant and equivariant representation learning objectives, reducing redundant feature learning and improving downstream transfer performance.
STaRFormer: Semi-Supervised Task-Informed Representation Learning via Dynamic Attention-Based Regional Masking: STaRFormer is proposed, which employs Dynamic Attention-based Regional Masking (DAReM) to identify task-critical regions and apply masking perturbations, coupled with intra-batch and intra-class semi-supervised contrastive learning to embed task information into latent representations. The method comprehensively outperforms state-of-the-art baselines across 56 datasets spanning non-stationary, irregularly sampled, classification, anomaly detection, and regression settings.
T-REGS: Minimum Spanning Tree Regularization for Self-Supervised Learning: This paper proposes T-REGS — a self-supervised learning regularization framework based on maximizing the length of the minimum spanning tree (MST). The authors theoretically prove that the method simultaneously prevents dimensional collapse and promotes uniform distribution of representations on compact Riemannian manifolds, with empirical validation on standard JE-SSL benchmarks.
TabArena: A Living Benchmark for Machine Learning on Tabular Data: This paper introduces TabArena, the first continuously maintained "living" benchmark for tabular machine learning. From 1,053 candidate datasets, 51 are curated and 16 models are evaluated through large-scale experiments (~25 million model training runs). Key findings: under post-hoc ensembling, deep learning models match or surpass GBDTs; tabular foundation models excel on small datasets; and cross-model ensembles further advance the state of the art.
TabSTAR: A Tabular Foundation Model for Tabular Data with Text Fields: TabSTAR is a foundation model designed specifically for tabular data with text fields. It achieves target-aware text representations through end-to-end optimization with an unfrozen text encoder (e5-small-v2), injects target semantics via target-aware tokens, and enables cross-dataset transfer learning through a dataset-parameter-free architecture. After pre-training on 350 datasets, TabSTAR surpasses CatBoost-Tuned (4h tuning) on 12 out of 14 classification datasets and outperforms TabPFN-v2 on 8 out of 11 datasets.
The Complexity of Finding Local Optima in Contrastive Learning: This paper proves that finding local optima in contrastive learning is computationally hard: the discrete triplet maximization problem is PLS-hard (even when $d=1$), and continuous triplet loss minimization is CLS-hard, implying that (under standard assumptions) no polynomial-time algorithm exists for finding local optima.
Towards Reliable and Holistic Visual In-Context Learning Prompt Selection: This paper proposes RH-Partial2Global, which for the first time employs Spearman rank correlation tests to demonstrate that the "similarity-first hypothesis" in VICL is statistically significant yet exhibits extremely weak correlation strength ($\bar{\rho} \approx 0.03\text{-}0.05$). By constructing reliable candidate sets via Jackknife conformal prediction and achieving comprehensive uniform pairwise preference sampling through covering designs, the method consistently outperforms state-of-the-art approaches across three visual tasks: segmentation, detection, and colorization.
TRIDENT: Tri-Modal Molecular Representation Learning with Taxonomic Annotations and Structural Relationships: TRIDENT is a tri-modal molecular representation learning framework that introduces Hierarchical Taxonomic Annotations (HTA) as a third modality. It combines a volumetric contrastive loss for global tri-modal alignment with a functional group–text local alignment module, dynamically balancing the two objectives via a momentum mechanism. The framework achieves state-of-the-art performance across 18 molecular property prediction tasks.
Uncertainty-Guided Model Selection for Tabular Foundation Models in Biomolecule Efficacy Prediction: This paper proposes OligoICP, a method that leverages the interquartile range (IQR) of TabPFN's predicted distributions as an unlabeled model selection heuristic, achieving superior performance over both specialized SOTA models and naive ensembles on siRNA knockdown efficiency prediction.
Understanding Ice Crystal Habit Diversity with Self-Supervised Learning: This paper presents the first application of self-supervised learning (SSL) to latent representation learning for ice crystal images. By pre-training a ViT on a large-scale cloud particle image dataset, the method learns continuous latent representations of ice crystal habits and quantifies habit diversity using the vMF concentration parameter, achieving a state-of-the-art classification accuracy of 84.39% with a 30× reduction in computational cost.
You Can Trust Your Clustering Model: A Parameter-free Self-Boosting Plug-in for Deep Clustering: This paper proposes DCBoost, a plug-and-play module requiring no additional hyperparameters, which selects high-confidence samples via adaptive k-NN and leverages reliable local structural information to guide global feature space optimization, substantially improving the performance of existing deep clustering models.

⚡ LLM Efficiency¶

3-Model Speculative Decoding (PyramidSD): PyramidSD introduces a three-tier pyramid decoding architecture by inserting an intermediate "qualifier" model between the draft model ($M_D$) and target model ($M_T$) in standard speculative decoding. The method exploits the natural entropy gradient across model scales within a model family to hierarchically filter tokens, and employs a fuzzy acceptance criterion to relax the matching threshold, achieving up to 1.91× speedup (reaching 124 tok/s on an RTX 4090).
A Unified Framework for Establishing the Universal Approximation of Transformer-Type Architectures: A unified theoretical framework is established for proving the universal approximation property (UAP) of diverse Transformer architectures. The framework rests on two core conditions — nonlinear affine invariance of the feed-forward layer and token distinguishability of the attention layer — and leverages an analyticity assumption to reduce the latter to verification on only two-sample cases. The framework successfully covers a wide range of practical architectures, including softmax, RBF kernel, Performer, BigBird, Linformer, and others.
Advancing Expert Specialization for Better MoE: By jointly optimizing an orthogonality loss (reducing projection overlap among experts) and a variance loss (increasing routing score diversity), the proposed method reduces expert overlap by 45% and improves routing variance by 150% without modifying the MoE architecture, achieving an average gain of 23.79% across 11 benchmarks while fully preserving load balance.
Approximately Aligned Decoding: This paper proposes Approximately Aligned Decoding (AprAD), a method for constrained generation in LLMs that leverages the prefix-selection algorithm from speculative decoding. Upon encountering a constraint violation, AprAD neither reverts only one token (as in constrained generation, which causes extreme probability amplification) nor resamples entirely from scratch (as in ASAp, which incurs prohibitive computational cost). Instead, it intelligently selects a rollback position via speculative sampling, achieving a favorable trade-off between output distribution distortion and computational efficiency.
Constant Bit-Size Transformers Are Turing Complete: This paper provides the first proof that a Transformer with constant bit-size precision and a fixed number of parameters — permitting only context window growth — is Turing complete. It establishes the exact complexity equivalence WINDOW[s(n)] = SPACE[s(n)], demonstrating that expanding the context window, rather than model size, suffices for universal computation.
Critical Batch Size Revisited: A Simple Empirical Approach to Large-Batch Language Model Training: This paper proposes a branched training method to directly measure the critical batch size (CBS) empirically, finding that CBS grows rapidly in early training before plateauing and is independent of model scale. Based on this insight, a batch size warmup strategy is designed that achieves equivalent or superior training loss with 43% fewer gradient steps.
DISC: Dynamic Decomposition Improves LLM Inference Scaling: DISC proposes a dynamic decomposition algorithm that automatically and recursively adjusts the granularity of reasoning steps at inference time based on the z-score (normalized maximum of sampled rewards) at each step — decomposing difficult steps more finely while taking larger strides over easy ones. It can be plugged into greedy search, Beam Search, and MCTS, achieving higher pass@k with fewer token budgets on APPS, MATH, and LiveCodeBench.
Dynamics of Spontaneous Topic Changes in Next Token Prediction with Self-Attention: This paper investigates, both theoretically and empirically, the dynamics of spontaneous topic changes in self-attention models. For a single-layer self-attention model, it establishes three results: (1) training on mixed topics preserves the token priority ordering of the original topic; (2) topic changes occur only when the number of low-priority tokens exceeds that of high-priority tokens; and (3) longer inputs and more ambiguous topics do not increase the probability of topic change — contrary to human cognition.
Efficient Training-Free Online Routing for High-Volume Multi-LLM Serving: This paper proposes PORT, the first training-free online LLM routing algorithm. PORT estimates query features via approximate nearest neighbor search (ANNS) and performs a one-shot optimization of dual variables as routing weights on a small set of initial queries. Under a limited token budget, PORT achieves near-offline-optimal routing performance with a $1-o(1)$ competitive ratio, delivering on average 3.55× performance improvement, 1.85× cost efficiency, and 4.25× throughput over baselines.
From Shortcut to Induction Head: How Data Diversity Shapes Algorithm Selection in Transformers: This paper provides rigorous theoretical analysis demonstrating that the diversity of pretraining data—characterized by the max-sum ratio—determines whether a single-layer Transformer learns a generalizable induction head or a non-OOD-generalizing positional shortcut, and derives a closed-form optimal pretraining distribution that promotes induction head formation.
Hardware-aligned Hierarchical Sparse Attention for Efficient Long-term Memory Access: This paper proposes Hierarchical Sparse Attention (HSA) and the RAMba architecture, which enable Mamba to perform efficient long-range random access through a two-stage token-to-chunk relevance learning mechanism and hardware-aligned kernel design. Pretrained on only 4K context, RAMba achieves 100% accuracy on 64M passkey retrieval.
Hierarchical Balance Packing: Towards Efficient Supervised Fine-tuning for Long-Context LLM: This paper proposes Hierarchical Balance Packing (HBP), which addresses attention computation imbalance and communication waste in mixed long/short-context SFT through multi-level packing groups, balanced batching, adaptive sequence parallelism, and stable loss normalization. HBP achieves a 2.4× training speedup on DeepSeek-V2 (236B) without performance degradation.
Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search: NVIDIA proposes the PostNAS pipeline — starting from a pretrained full-attention model, freezing MLP weights, and applying a four-step search (full-attention layer placement → linear attention block selection → novel JetBlock design → hardware-aware hyperparameter search) to yield the hybrid Jet-Nemotron architecture. The 2B model surpasses Qwen3-1.7B on MMLU-Pro while achieving 47× higher generation throughput.
L-MTP: Leap Multi-Token Prediction Beyond Adjacent Context for Large Language Models: L-MTP introduces a leap mechanism into multi-token prediction (MTP) by predicting tokens at non-adjacent positions (e.g., positions 1, 3, 5, 7 instead of 1, 2, 3, 4). A "looking backward" decoding strategy reuses prior predictions to fill the gaps, achieving a 22% inference speedup on 3B–12B models while maintaining or improving task performance.
Let the Experts Speak: Improving Survival Prediction & Calibration via Mixture-of-Experts Heads: Three discrete-time deep Mixture-of-Experts (MoE) survival analysis architectures are proposed, among which Personalized MoE achieves superior clustering, calibration, and predictive accuracy simultaneously by allowing each expert to generate a patient-specific event distribution.
Linear Attention for Efficient Bidirectional Sequence Modeling: This paper proposes Lion, a framework that, for the first time, systematically extends linear Transformers to bidirectional sequence modeling. It unifies three equivalent representations—full linear attention, bidirectional RNN, and chunkwise parallel—and achieves training speeds up to 10× faster than SSM-based approaches while delivering performance comparable to softmax Transformers on image classification and MLM tasks.
Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs: This paper proposes Dynamic Hierarchical Sparse Attention (DHSA), a hierarchical framework that replaces dense attention with sparse attention via adaptive chunk segmentation, chunk-level similarity prediction, and upsampling to token level — without retraining the base model. On Gemma2/3, DHSA achieves accuracy on par with dense attention while reducing prefill latency by 20–60% and peak memory by 35%.
LooGLE v2: Are LLMs Ready for Real World Long Dependency Challenges?: LooGLE v2 is a long-dependency reasoning benchmark spanning four real-world domains—legal, financial, gaming, and code—with context lengths ranging from 16K to 2M tokens. It comprises 10 domain-specific task types and 1,934 QA instances. Evaluation of 10 LLMs reveals that the strongest model, GPT-4.1, achieves only 59.2%, exposing fundamental deficiencies of current LLMs in real-world long-dependency scenarios.
MoESD: Revealing the Potential of Speculative Decoding to Accelerate Sparse MoE: This work challenges the prevailing belief that speculative decoding (SD) is ineffective for MoE models. Through theoretical analysis and experiments, it demonstrates that MoE models benefit more from SD than dense models at medium batch sizes. The paper introduces target efficiency as a system-level metric to quantify acceleration bottlenecks, constructs a reliable performance prediction model, and achieves up to 2.29× speedup on Qwen2-57B-A14B.
Mozart: Modularized and Efficient MoE Training on 3.5D Wafer-Scale Chiplet Architectures: Mozart is an algorithm-hardware co-design framework that achieves over 1.9× training speedup on three MoE-LLMs via expert clustering and allocation, fine-grained streaming scheduling, and a 3.5D chiplet architecture (NoP-Tree + hierarchical memory).
OmniDraft: A Cross-Vocabulary Online Adaptive Drafter for On-Device Speculative Decoding: This paper proposes the OmniDraft framework, which achieves cross-vocabulary speculative decoding via an online n-gram cache, aligns the draft model with the target model through a hybrid distillation loss, and dynamically adjusts the proposal length with an adaptive drafting head. A single lightweight Llama-68M model thereby provides speculative decoding acceleration (1.5–2×) for diverse target models such as Vicuna-7B, Qwen2-7B, and Llama3-8B.
On the Expressive Power of Mixture-of-Experts for Structured Complex Tasks: This paper presents the first systematic analysis of MoE expressive power on structured complex tasks. It proves that shallow MoE can overcome the curse of dimensionality on low-dimensional manifolds (approximation rate governed by intrinsic dimension $d$ rather than ambient dimension $D$), and that deep MoE with $E$ experts × $L$ layers can efficiently approximate piecewise functions with $E^L$ pieces through hierarchical composition—far exceeding the naive upper bound of $LE$.
Scale-invariant Attention: Drawing inspiration from the scale invariance of natural images, this paper proposes a position-dependent affine transformation on attention logits—comprising a multiplicative scaling and an additive shift—such that the total attention weight and sparsity over any token range satisfy scale invariance. This enables zero-shot generalization from short-context training to long-context inference (4k→64k) with a single hyperparameter $\tau$.
Silent Tokens, Loud Effects: Padding in LLMs: This paper systematically investigates the effects of padding tokens on LLMs when they are not properly masked. The study finds that even a small number of padding tokens can drift hidden-layer representations, degrade generation quality, and unpredictably shift social biases. Critically, 128 padding tokens raise the harmful prompt attack success rate of Llama-3.1-8B from 8% to 77.5%, effectively constituting a jailbreak.
SkyLadder: Better and Faster Pretraining via Context Window Scheduling: SkyLadder, a progressive short-to-long context window scheduling strategy, achieves superior pretraining efficiency (22% training time saved) and improved model performance (+3.7%) under a fixed compute budget, challenging the prevailing belief that "longer context = better performance."
SPARTA Alignment: Collectively Aligning Multiple Language Models through Combat: Multiple LLMs form a "Spartan tribe" to engage in mutual competition and peer evaluation. Preference pairs are generated via reputation-weighted judgment aggregation, and all models are iteratively trained with DPO. The approach surpasses self-alignment baselines such as Self-Rewarding on 10 out of 12 tasks, with an average improvement of 7%.
Technical Debt in In-Context Learning: Diminishing Efficiency in Long Context: Drawing on the methodology of optimization software benchmarking, this work precisely quantifies the sample efficiency of ICL relative to the Bayes-optimal estimator via performance ratios. A clear dichotomy is identified: in the few-shot regime (≤15 demonstrations), efficiency is near-optimal (only ~10% overhead), whereas in the many-shot regime (>40 demonstrations) it degrades sharply (>45% overhead). Information-theoretic analysis establishes that this phenomenon stems from a non-decreasing excess risk that is irreducible—an intrinsic limitation of the ICL mechanism.
Tensor Product Attention Is All You Need: By decomposing Q/K/V into weighted sums of low-rank factors via contextual tensor products, TPA compresses the KV cache by 10–16×, while surpassing standard MHA/MQA/GQA/MLA on both validation loss and downstream task accuracy.
The Emergence of Sparse Attention: Impact of Data Distribution and Benefits of Repetition: This paper investigates the emergence mechanism of sparse attention through theoretical analysis and controlled experiments, revealing that the emergence time follows a power-law relationship with respect to sequence length and dimensionality, $T_\epsilon \propto \sqrt{d} \cdot T$. It further demonstrates that both in-context and cross-sample data repetition strategies accelerate emergence, offering a unified sparse attention perspective for understanding capability emergence in LLMs.
The PokeAgent Challenge: Competitive and Long-Context Learning at Scale: This paper introduces the PokéAgent Challenge, a large-scale dual-track AI benchmark built on Pokémon competitive battling and RPG speedrunning. Validated through the NeurIPS 2025 competition, it demonstrates that specialist RL methods substantially outperform general-purpose LLM approaches, and reveals that the capabilities measured by Pokémon battling are nearly orthogonal to those assessed by 49 existing LLM benchmarks.
Tiled Flash Linear Attention: More Efficient Linear RNN and xLSTM Kernels: This paper proposes TFLA (Tiled Flash Linear Attention), which achieves efficient linear RNN/mLSTM kernels through two-level sequence parallelism and tiling optimization, delivering significant wall-clock speedups over FlashAttention 3 and Mamba 2 (>2× in training vs. Mamba 2) while maintaining equivalent model accuracy.
UMoE: Unifying Attention and FFN with Shared Experts: By reformulating the multi-head attention mechanism, this work reveals that attention shares the same "two-layer matrix multiplication" structure as FFN layers. Based on this insight, UMoE is proposed as a unified architecture that employs identically designed experts for both attention and FFN layers with parameter sharing, outperforming existing FFN-MoE and Attention-MoE baselines on both Base (134M) and Large (1.1B) models.
Unmasking COVID-19 Vulnerability in Nigeria: Mapping Risks Beyond Urban Hotspots: This paper constructs a comprehensive COVID-19 vulnerability risk scoring system for Nigerian states, integrating four dimensions — population density, poverty, healthcare accessibility, and age risk — and visualizes hotspot regions via GIS mapping, providing a data-driven decision tool for public health resource allocation.
Yggdrasil: Bridging Dynamic Speculation and Static Runtime for Latency-Optimal Tree-Based LLM Decoding: This paper proposes Yggdrasil, a latency-optimal speculative decoding system that achieves compiler-friendly dynamic drafting via the Equal-Growth Tree (EGT) structure, replaces the conventional AAL metric with a latency-aware optimization objective, and reduces CPU-GPU coordination overhead through a stage-based scheduling runtime, achieving up to 3.98× end-to-end speedup on A100/A40 GPUs.
ZeroS: Zero-Sum Linear Attention for Efficient Transformers: By removing the zeroth-order uniform term $1/t$ from softmax, ZeroS constructs a linear attention mechanism with zero-sum weights, breaking the limitation of convex combinations to purely additive mixing. This enables differential/contrastive operations within a single layer while maintaining $O(Nd^2)$ linear complexity, matching or surpassing standard softmax attention across multiple sequence modeling benchmarks.

🔍 Information Retrieval & RAG¶

Benchmarking Retrieval-Augmented Multimodal Generation for Document Question Answering: This paper introduces the MMDocRAG benchmark (4,055 expert-annotated QA pairs) to systematically evaluate 60 VLMs/LLMs and 14 retrievers on quote selection and interleaved text-image answer generation in multimodal document RAG. Results reveal that the strongest model, GPT-4.1, achieves only 70.2% Quote Selection F1, while fine-tuning yields substantial performance gains.
Chain-of-Retrieval Augmented Generation (CoRAG): This paper proposes CoRAG, a framework that automatically generates intermediate retrieval chains (sub-query → sub-answer) via rejection sampling, fine-tunes an LLM to learn iterative retrieval and reasoning, and supports diverse test-time decoding strategies (greedy / Best-of-N / tree search) for flexible compute scaling. CoRAG achieves 26+ EM improvement on multi-hop QA and attains state-of-the-art on 9/10 tasks of the KILT benchmark.
Cooperative Retrieval-Augmented Generation for Question Answering: Mutual Information Exchange and Ranking by Contrasting Layers: CoopRAG is a framework that achieves bidirectional cooperation between the retriever and the LLM through query expansion, retriever layer-contrastive reranking, and reasoning chain completion. It surpasses HippoRAG2 by 5.3% on multi-hop QA and by 35.2% on single-hop QA.
Deep Research Brings Deeper Harm: This paper reveals critical safety vulnerabilities in Deep Research (DR) agents — even when the underlying LLM correctly refuses harmful queries, deploying it as a DR agent can still produce detailed, professional, and dangerous reports. Two targeted jailbreak methods, Plan Injection and Intent Hijack, are proposed alongside the DeepREJECT evaluation metric. Experiments on 6 LLMs demonstrate that DR agents systematically undermine alignment mechanisms.
DICE: Discrete Interpretable Comparative Evaluation with Probabilistic Scoring for RAG: This paper proposes the DICE framework, which achieves interpretable, robust, and efficient evaluation of RAG systems through a two-stage assessment pipeline (evidence-coupled deep analysis + probabilistic {A, B, Tie} scoring) combined with a Swiss-system tournament. On a Chinese financial QA dataset, DICE attains 85.7% agreement with human experts, substantially outperforming RAGAS (45.7%).
Generalized Contrastive Learning for Universal Multimodal Retrieval: This paper proposes Generalized Contrastive Learning (GCL), which performs contrastive learning over all 6 modality-pair combinations within a mini-batch (image↔text, image↔image+text, text↔image+text). Without constructing new triplet datasets and using only existing image-text pairs, GCL improves VISTA's average retrieval precision on M-BEIR from 21.18 to 34.06 (+60.8%), and on the text→image+text task of MMEB from 10.1% to 31.1%.
Hierarchical Retrieval: The Geometry and a Pretrain-Finetune Recipe: This paper investigates the feasibility of Dual Encoders (DE) for Hierarchical Retrieval (HR), theoretically proving that embedding dimensionality need only grow linearly with hierarchy depth and logarithmically with document count. After identifying the "lost-in-the-long-distance" phenomenon, the paper proposes a pretrain-finetune strategy that improves long-distance recall from 19% to 76% on WordNet.
HiFi-RAG: Hierarchical Content Filtering and Two-Pass Generation for Open-Domain RAG: By decoupling the filtering capability of a lightweight Flash model from the reasoning capability of a Pro model, the paper constructs a multi-stage pipeline (query optimization → hierarchical filtering → two-pass generation → citation verification) that achieves SOTA performance in the MMU-RAGent competition.
How Should We Evaluate Data Deletion in Graph-Based ANN Indexes?: To address the lack of a unified evaluation methodology for data deletion in graph-based ANN indexes, this paper formally defines three baseline approaches—lazy deletion, eager deletion, and reconstruction—proposes a deployment-oriented evaluation framework and metric suite, and introduces the Deletion Control algorithm, which dynamically switches deletion strategies under accuracy constraints based on empirical analysis.
HyperGraphRAG: Retrieval-Augmented Generation via Hypergraph-Structured Knowledge Representation: This paper proposes HyperGraphRAG, the first RAG method based on hypergraph structure, which models n-ary relations ($n \geq 2$) via hyperedges. It overcomes the binary-relation bottleneck of existing graph-based RAG methods, achieving comprehensive improvements over StandardRAG and the GraphRAG family on question-answering tasks across medical, agricultural, computer science, and legal domains.
Improving Consistency in Retrieval-Augmented Systems with Group Similarity Rewards: This paper proposes Con-RAG, a framework that trains RAG generators to produce informationally consistent outputs under paraphrased inputs by computing group similarity rewards across multiple generations of semantically equivalent queries via Paraphrased Set GRPO (PS-GRPO), simultaneously improving both consistency and accuracy without requiring explicit ground-truth supervision.
Is PRM Necessary? Problem-Solving RL Implicitly Induces PRM Capability in LLMs: This study systematically demonstrates that pure RL training (without explicit PRM supervision) implicitly induces strong process judgment capability; existing PRMs are even less effective than simple majority voting on strong reasoning models such as DeepSeek-R1 and QwQ-32B. The paper proposes Self-PRM, which allows a model to rerank its outputs using its own internal reward signal, consistently outperforming external PRMs.
Learning Task-Agnostic Representations through Multi-Teacher Distillation: This paper proposes a task-agnostic multi-teacher distillation framework based on mutual information maximization. By estimating the conditional distribution of teacher embeddings via Gaussian kernels, the student model learns high-information-density general-purpose representations without relying on any downstream task labels, achieving state-of-the-art performance among same-scale models across text, vision, and molecular modeling domains.
MIR-Bench: Can Your LLM Recognize Complicated Patterns via Many-Shot In-Context Reasoning?: This paper proposes MIR-Bench, the first large-scale and diverse many-shot in-context reasoning benchmark. By automatically generating input-output pairs from programming problems, it evaluates LLMs' pattern recognition capabilities, revealing a performance saturation phenomenon caused by attention diffusion in many-shot settings, and demonstrating that transductive reasoning consistently outperforms inductive reasoning across models.
MITRA: An AI Assistant for Knowledge Retrieval in Physics Collaborations: This paper proposes MITRA, a locally deployed RAG system for large physics experiment collaborations (e.g., CERN CMS), featuring a two-tier vector database architecture (abstract store + full-text store) and a fully on-premise deployment strategy. MITRA substantially outperforms traditional keyword-based search (BM25) on semantic retrieval tasks, improving Precision@1 from 0.13 to 0.75.
MuRating: A High Quality Data Selecting Approach to Multilingual Large Language Model Pretraining: This paper proposes MuRating, a scalable multilingual data selection framework that aggregates multiple English data quality scorers via pairwise comparisons, transfers quality signals to 17 languages through translation, and trains a language-agnostic multilingual quality assessment model, achieving consistent performance gains in LLM pretraining at both 1.2B and 7B scales.
RAG-IGBench: Innovative Evaluation for RAG-based Interleaved Generation in Open-domain Question Answering: This paper introduces RAG-IGBench, a benchmark specifically designed to evaluate the quality of interleaved image-text content generated via retrieval-augmented generation. It proposes novel automatic evaluation metrics spanning three dimensions—text quality, image quality, and image-text consistency—and demonstrates strong correlation with human evaluation.
Reliable Decision Making via Calibration Oriented Retrieval Augmented Generation: This paper proposes CalibRAG, a framework that trains a temperature-conditioned forecasting function to ensure confidence calibration in RAG-assisted decision-making, achieving improvements in both calibration quality and accuracy.
Retrieval-Augmented Generation for Reliable Interpretation of Radio Regulations: This paper presents a dedicated RAG pipeline for radio regulations—a legally sensitive, high-stakes domain—and introduces the first ITU radio regulation multiple-choice evaluation benchmark. The proposed system achieves 97% retrieval accuracy and an +11.9% QA accuracy gain over GPT-4o, substantially outperforming naive full-document insertion into the prompt.
Retrieval is Not Enough: Enhancing RAG Reasoning through Test-Time Critique and Optimization: This paper proposes AlignRAG, a framework that reframes RAG as "retrieval-augmented reasoning" and trains a dedicated Critic Language Model (CLM) to iteratively critique and refine the reasoning process at test time, addressing the misalignment between reasoning chains and retrieved evidence. An 8B CLM surpasses a 72B standard CLM on out-of-distribution tasks.
RMIT-ADM+S at the MMU-RAG NeurIPS 2025 Competition: This paper proposes the Routing-to-RAG (R2RAG) system, which employs an LLM-based query classifier to route simple queries to single-turn Vanilla RAG and complex queries to iterative Vanilla Agent. All components are built upon two lightweight models — Qwen3-4B (unquantized) and Qwen3-Reranker-0.6B — running on a single consumer-grade GPU, and the system won the Best Dynamic Evaluation award in the open-source track of the NeurIPS 2025 MMU-RAG competition.
Scaling Language-Centric Omnimodal Representation Learning: This paper proposes the LCO-Emb framework and demonstrates that Multimodal Large Language Models (MLLMs) implicitly establish cross-modal alignment during generative pretraining. Lightweight text-only contrastive fine-tuning suffices to activate full omnimodal representation capabilities. The work further identifies the Generation-Representation Scaling Law (GRSL), which establishes a positive correlation between generative capability and representation performance.
SeCon-RAG: A Two-Stage Semantic Filtering and Conflict-Free Framework for Trustworthy RAG: This paper proposes SeCon-RAG, a two-stage defense framework. The first stage employs clustering combined with semantic graph filtering to remove poisoned documents, while the second stage performs conflict-aware filtering at inference time. SeCon-RAG comprehensively outperforms existing RAG defense methods across 5 LLMs and 3 QA datasets, maintaining high accuracy and near-zero attack success rates even under 100% poisoning rates.
SuperCLIP: CLIP with Simple Classification Supervision: SuperCLIP augments the CLIP contrastive learning framework with an extremely simple classification loss — requiring only a lightweight linear layer that increases total FLOPs by merely 0.077% — to recover fine-grained textual supervision that CLIP underutilizes, achieving consistent improvements on zero-shot classification, image-text retrieval, and vision-only tasks.
SymRTLO: Enhancing RTL Code Optimization with LLMs and Neuron-Inspired Symbolic Reasoning: This paper proposes SymRTLO, the first neurosymbolic framework integrating LLMs with symbolic reasoning for RTL code optimization. By combining retrieval-augmented optimization rules, AST template-guided code generation, and an FSM symbolic system, SymRTLO achieves improvements of up to 43.9%, 62.5%, and 51.1% in power, performance, and area (PPA), respectively.
The Narrow Gate: Localized Image-Text Communication in Native Multimodal Models: Through systematic interpretability analysis, this work discovers that in native multimodal VLMs (Chameleon, Emu3), image-to-text cross-modal information transfer is concentrated at a single end-of-image [EOI] token—forming a "narrow gate" bottleneck. Ablating the [EOI] token's attention causes catastrophic performance collapse, whereas in non-native VLMs (LLaVA, etc.) the information transfer is distributed. This mechanistic difference can be exploited for semantic manipulation and robustness improvement.
The Transparent Earth: A Multimodal Foundation Model for the Earth's Subsurface: This paper proposes Transparent Earth, a Transformer-based multimodal foundation model that fuses 8 heterogeneous geophysical observation modalities via positional encoding and text-derived modality embeddings, enabling zero-shot inference and in-context learning for Earth subsurface property prediction.
Think Straight, Stop Smart: Structured Reasoning for Efficient Multi-Hop RAG: This paper proposes the TSSS (Think Straight, Stop Smart) framework, which achieves state-of-the-art accuracy and competitive efficiency on multi-hop RAG benchmarks through (i) template-based reasoning that caches repeated prefixes and anchors sub-queries to the main question, and (ii) a retriever-based deterministic terminator that halts reasoning upon sub-query repetition.
Windsock is Dancing: Adaptive Multimodal Retrieval-Augmented Generation: This paper proposes a dual-component framework (Windsock + DANCE) to address three core challenges in multimodal RAG: the Windsock module adaptively determines when to retrieve and which modality to retrieve (text/image/none) based on the query; the DANCE instruction fine-tuning strategy improves how to utilize retrieved information by dynamically selecting the model's weakest modality for noise-robust training. The overall framework achieves a 17.07% performance improvement while reducing retrieval calls by 8.95%.
Worse than Zero-shot? A Fact-Checking Dataset for Evaluating the Robustness of RAG Against Misleading Retrievals: This paper introduces RAGuard, the first benchmark dataset to systematically evaluate the robustness of RAG systems against misleading retrieved content. By constructing a realistic retrieval corpus from Reddit — containing supporting, misleading, and unrelated documents — it demonstrates that all tested LLM-RAG systems perform worse than a zero-shot baseline when exposed to misleading retrievals, whereas human annotators maintain consistent judgment.

🖼️ Image Restoration¶

Adaptive Discretization for Consistency Models: This paper proposes ADCM, which formalizes the discretization step size of consistency models as a constrained optimization problem balancing local consistency (trainability) and global consistency (stability), derives a closed-form solution via the Gauss-Newton method, and achieves adaptive discretization that surpasses all prior CMs on CIFAR-10 using less than 25% of the training budget.
Audio Super-Resolution with Latent Bridge Models: This paper proposes AudioLBM, which compresses audio waveforms into a continuous latent space and employs a bridge model to realize a latent-to-latent generation process from low-resolution to high-resolution. Combined with frequency-aware training for broader data utilization and a cascaded design to surpass the 48kHz ceiling, AudioLBM comprehensively outperforms methods such as AudioSR across speech, sound effects, and music, while achieving any-to-192kHz audio super-resolution for the first time.
DenoiseRotator: Enhance Pruning Robustness for LLMs via Importance Concentration: This paper proposes DenoiseRotator, a pre-pruning method that applies learnable orthogonal transformations to minimize the information entropy of parameter importance scores, concentrating importance into a small subset of parameters. On LLaMA3-70B under 2:4 semi-structured sparsity, perplexity degradation is reduced by 58% (8.1→3.4). The method is plug-and-play and compatible with Magnitude, Wanda, and SparseGPT.
DynaGuide: Steering Diffusion Policies with Active Dynamic Guidance: This paper proposes DynaGuide, which applies classifier guidance to a frozen pretrained diffusion policy at inference time via an external latent dynamics model, steering the robot toward arbitrary positive/negative goals without modifying policy weights. It achieves an average success rate of 70% on CALVIN simulation and 80% on a real robot.
Enhancing Infrared Vision: Progressive Prompt Fusion Network and Benchmark: To address the challenge of coupled degradations (low contrast, blur, and noise) in thermal infrared (TIR) images, this paper proposes PPFN, a progressive prompt fusion network with a dual-prompt design, along with the Selective Progressive Training (SPT) strategy. The authors also construct HM-TIR, the first large-scale multi-scene TIR benchmark dataset. The proposed method achieves an 8.76% PSNR improvement in composite degradation scenarios.
FIPER: Factorized Features for Robust Image Super-Resolution and Compression: This paper proposes a Factorized Features representation that decomposes images into learnable non-uniform bases and spatially variant coefficients, augmented with sawtooth coordinate transformation and multi-frequency modulation. The approach achieves a 204.4% relative PSNR gain at 4× super-resolution (HAT-L-F vs. SwinIR) and a 21.09% BD-rate reduction over VTM in image compression.
GC4NC: A Benchmark Framework for Graph Condensation on Node Classification with New Insights: This paper proposes GC4NC—the first systematic benchmark framework for graph condensation (GC)—which evaluates multiple GC methods across 8 dimensions (performance / efficiency / privacy protection / denoising / NAS effectiveness / transferability, etc.), finding that trajectory matching methods achieve the best performance, structure-free methods are most efficient, and graph condensation significantly outperforms image condensation under 1000× compression.
Implicit Augmentation from Distributional Symmetry in Turbulence Super-Resolution: This paper demonstrates that the statistical isotropy of turbulence itself constitutes a form of implicit data augmentation, enabling standard CNNs to partially learn rotational equivariance in super-resolution tasks without explicit rotation augmentation or equivariant architectures. The authors further show that the scale dependence of equivariance error is consistent with Kolmogorov's local isotropy hypothesis.
Improving Diffusion-based Inverse Algorithms under Few-Step Constraint via Learnable Linear Extrapolation: This paper proposes Learnable Linear Extrapolation (LLE), which combines current and historical clean data estimates via learnable linear coefficients to enhance any diffusion inverse problem algorithm conforming to the Sampler-Corrector-Noiser paradigm under few-step (3–5 steps) constraints. The method requires only 50 training samples and a few minutes of training, yielding consistent improvements across 9+ algorithms × 5 tasks.
Latent Harmony: Synergistic Unified UHD Image Restoration via Latent Space Regularization and Controllable Refinement: This paper proposes Latent Harmony, a two-stage framework that constructs a generalizable VAE (LH-VAE) via latent space regularization, and introduces a high-frequency-guided controllable LoRA fine-tuning mechanism, achieving flexible fidelity-perceptual quality trade-offs in unified multi-degradation UHD image restoration while preserving structural integrity.
Latent Harmony: Synergistic Unified UHD Image Restoration via Latent Space Regularization and Controllable Refinement: This paper proposes Latent Harmony, a two-stage framework that constructs a degradation-robust LH-VAE via latent space regularization, and subsequently applies high-frequency-guided LoRA fine-tuning to independently optimize the encoder (fidelity) and decoder (perceptual quality), achieving a unified solution to the generalization–reconstruction–perception trilemma in all-in-one UHD image restoration.
Learning Cocoercive Conservative Denoisers via Helmholtz Decomposition for Poisson Inverse Problems: This paper introduces the concept of Cocoercive Conservative (CoCo) denoisers and proposes a novel training strategy via generalized Helmholtz decomposition — Hamiltonian regularization to promote conservativeness and spectral regularization to promote cocoerciveness — enabling denoisers to serve as proximal operators of implicit weakly convex priors, thereby achieving convergence-guaranteed and high-performance PnP methods for Poisson inverse problems (photon-limited deconvolution, low-dose CT, etc.).
Luminance-Aware Statistical Quantization: Unsupervised Hierarchical Learning for Illumination Enhancement: This paper proposes the LASQ framework, which reformulates low-light image enhancement (LLIE) as a statistical sampling process over hierarchical luminance distributions. By exploiting the power-law distribution inherent in natural luminance transitions, LASQ employs MCMC sampling to generate hierarchical luminance adaptation operators (LAOs) that are embedded into the forward process of a diffusion model, enabling fully unsupervised enhancement without requiring any normal-light reference images.
MAP Estimation with Denoisers: Convergence Rates and Guarantees: This paper proves that a simple iterative averaging algorithm based on MMSE denoisers—closely related to practical methods such as Cold Diffusion—provably converges to the proximal operator of the negative log-prior under log-concave prior assumptions, achieving a convergence rate of $\tilde{O}(1/k)$. The work provides rigorous theoretical foundations for a class of denoising methods that have demonstrated empirical success but lacked theoretical guarantees, and embeds the approach within a proximal gradient descent framework for MAP estimation.
MoDEM: A Morton-Order Degradation Estimation Mechanism for Adverse Weather Image Restoration: This paper proposes the MODEM framework, which combines Morton-encoded spatial scanning with selective state space models (SSMs) to capture spatially heterogeneous weather degradation patterns. Equipped with a dual degradation estimation module that provides both global and local priors, MODEM achieves state-of-the-art unified adaptive restoration across multiple adverse weather degradation types.
MoE-Gyro: Self-Supervised Over-Range Reconstruction and Denoising for MEMS Gyroscopes: This paper proposes MoE-Gyro, a self-supervised Mixture-of-Experts framework that simultaneously addresses the fundamental range–noise trade-off in MEMS gyroscopes via an Over-Range Reconstruction Expert (ORE, incorporating Gaussian-Decay Attention and physics-informed constraints) and a Denoising Expert (DE, incorporating dual-branch complementary masking and FFT-guided augmentation). The measurable range is extended from ±450°/s to ±1500°/s, and bias instability is reduced by 98.4%.
MRO: Enhancing Reasoning in Diffusion Language Models via Multi-Reward Optimization: This work presents the first systematic analysis of the root cause behind the reasoning gap in diffusion language models (DLMs)—namely, the independent generation of tokens during denoising, which disrupts both intra- and inter-sequence correlations. A multi-reward optimization framework, MRO, is proposed and consistently improves reasoning performance of LLaDA-8B across test-time scaling, reject sampling, and RL paradigms, raising MATH500 accuracy from 34.4% to 37.4%.
MS-BART: Unified Modeling of Mass Spectra and Molecules for Structure Elucidation: This paper proposes MS-BART, which maps molecular fingerprints and molecular structures (SELFIES) into a shared token space via a unified vocabulary, performs multi-task pretraining on 4 million fingerprint–molecule pairs, and subsequently applies experimental spectra fine-tuning and chemical feedback alignment to enable efficient generation of molecular structures from mass spectra.
Real-World Adverse Weather Image Restoration via Dual-Level Reinforcement Learning with High-Quality Cold Start: This paper proposes a Dual-level Reinforcement Learning (DRL) framework that combines a physics-driven million-scale synthetic weather dataset, HFLS-Weather, for high-quality cold-start training, and achieves adaptive real-world adverse weather image restoration through Perturbation-driven Image Quality Optimization (PIQO) at the local level and global meta-controller multi-agent collaboration.
Rethinking Circuit Completeness in Language Models: AND, OR, and ADDER Gates: This paper systematically introduces AND, OR, and ADDER gates to decompose language model circuits, reveals that circuit incompleteness primarily stems from the omission of OR gates, and proposes a framework combining noising and denoising interventions to fully recover all three gate types while guaranteeing both faithfulness and completeness.
Rethinking Nighttime Image Deraining via Learnable Color Space Transformation: Motivated by the statistical finding that nighttime rain exhibits far greater contrast in the Y channel (luminance) of YCbCr than in RGB, this work proposes a learnable Color Space Converter (CSC) that performs deraining in the Y channel, an Implicit Illumination Guidance (IIG) module that encodes non-uniform nighttime illumination, and a photorealistic dataset HQ-NightRain constructed via illumination-aware synthesis. The three components jointly yield substantial improvements in nighttime deraining performance.
SCAN: Self-Denoising Monte Carlo Annotation for Robust Process Reward Learning: This paper proposes the SCAN framework, which analyzes the noise distribution in Monte Carlo annotations to design a self-denoising sampling strategy and a robust learning loss. A PRM trained on only 101K samples generated by a 1.5B model surpasses the effect of the human-annotated dataset PRM800K.
scSplit: Bringing Severity Cognizance to Image Decomposition in Fluorescence Microscopy: This paper proposes scSplit, which introduces a severity-cognizant input normalization module (SCIN) and a regression network (Reg) to endow an InDI-based iterative image decomposition framework with awareness of the mixing severity of two overlapping structures in fluorescence microscopy images. The method unifies image splitting and bleedthrough removal across five public datasets under a single framework.
Spend Wisely: Maximizing Post-Training Gains in Iterative Synthetic Data Bootstrapping: This paper provides the first theoretical analysis of the budget allocation problem in iterative synthetic data bootstrapping, proving that constant strategies fail to converge with high probability, that exponential growth strategies outperform polynomial strategies in the worst case, and validating these findings empirically on image denoising (DPM) and mathematical reasoning (LLM) tasks.
Spiking Meets Attention: Efficient Remote Sensing Image Super-Resolution with Attention Spiking Neural Networks: This paper proposes SpikeSR, the first attention-based spiking neural network (SNN) framework for remote sensing image super-resolution. By incorporating Spiking Attention Blocks (SAB) that combine Hybrid Dimensional Attention (HDA) and Deformable Similarity Attention (DSA), SpikeSR achieves state-of-the-art performance on AID/DOTA/DIOR while maintaining high computational efficiency.
The Effect of Optimal Self-Distillation in Noisy Gaussian Mixture Model: This paper presents a rigorous theoretical analysis of hyperparameter-optimized multi-stage self-distillation on noisy Gaussian mixture data using the replica method from statistical physics. It reveals that the denoising effect of hard pseudo-labels is the primary driver of performance gains in self-distillation, that moderate-sized datasets benefit the most, and proposes two practical improvement strategies—early stopping (limiting the number of distillation stages) and bias parameter fixing. Theoretical predictions are validated through experiments on CIFAR-10 with ResNet.

🎁 Recommender Systems¶

ASAP: An Agentic Solution to Auto-Optimize Performance of Large-Scale LLM Training: ASAP is a multi-agent system (Coordinator + Analyzer + Proposal) that automatically diagnoses bottleneck types (compute/memory/communication) in large-scale LLM distributed training and proposes sharding configurations. Across 3 experimental scenarios, it matches human expert solutions and achieves up to 2.58× throughput improvement.
Balancing Performance and Costs in Best Arm Identification: This paper proposes to reformulate Best Arm Identification (BAI) from the fixed-budget/fixed-confidence paradigm into a risk functional minimization problem over misidentification probability (or simple regret) plus sampling cost. It derives lower bounds exhibiting a phase transition phenomenon (when the gap is too small, the optimal strategy is to guess directly), and designs the DBCARE algorithm that achieves optimality within logarithmic factors under a dynamic budget.
EMPATHIA: Multi-Faceted Human-AI Collaboration for Refugee Integration: This paper proposes EMPATHIA, a multi-agent framework grounded in Kegan's constructive-developmental theory. Three specialized agents—emotional, cultural, and ethical—engage in selector-validator negotiation to evaluate refugee resettlement recommendations. On real-world data from 6,359 refugees, the framework achieves an 87.4% convergence rate and 92.1% cultural expert agreement rate.
Estimating Hitting Times Locally At Scale: Two local (sublinear) algorithms are proposed for estimating hitting times on graphs — Algorithm 1 based on meeting times and Algorithm 3 based on spectral truncation. Both require only short random walks centered at $u$ and $v$ without full graph access, achieving relative error <1.4% on synthetic and real-world graphs. An optimal sample complexity lower bound for walk-based estimation is also established.
FACE: A General Framework for Mapping Collaborative Filtering Embeddings into LLM Tokens: FACE proposes mapping collaborative filtering (CF) embeddings into LLM pre-trained tokens (descriptors) via disentangled projection and residual quantization, followed by contrastive learning for semantic alignment — enabling semantic interpretation and recommendation enhancement of CF embeddings without fine-tuning the LLM.
Inference-Time Reward Hacking in Large Language Models: This paper mathematically proves that inference-time alignment methods (e.g., BoN) inevitably exhibit reward hacking (true reward first increases then decreases) when optimizing a proxy reward. It proposes Best-of-Poisson (BoP) sampling to approximate the optimal KL-reward trade-off distribution, and designs the HedgeTune algorithm to locate the optimal inference-time parameter via one-dimensional root-finding, effectively mitigating reward hacking in both mathematical reasoning and human preference settings.
Measuring What Matters: Construct Validity in Large Language Model Benchmarks: This paper presents a systematic review of 445 LLM benchmark papers conducted by 29 experts, examining existing LLM evaluation benchmarks through the lens of construct validity across four dimensions — phenomenon definition, task design, scoring metrics, and conclusion claims — and proposes 8 actionable recommendations for improvement.
MMPB: It's Time for Multi-Modal Personalization: This paper introduces MMPB, the first VLM personalization evaluation benchmark, comprising 111 personalizable concepts, 10k+ image-text QA pairs, and 15 task types. Evaluation of 23 VLMs reveals that even the strongest model, GPT-4o, performs poorly on personalization tasks, exposing critical limitations in preference reasoning, visual cue utilization, and conflicts between safety alignment and personalization.
NeurIPS Should Lead Scientific Consensus on AI Policy: This position paper argues that NeurIPS should proactively assume the role of facilitating scientific consensus in AI policy, drawing on the successful experience of the IPCC (Intergovernmental Panel on Climate Change) in climate science to fill the current gap in AI policy consensus mechanisms.
Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning: This work identifies two categories of sparsity artifacts introduced by L1 loss in Crosscoders—Complete Shrinkage (which erroneously zeros out weakly shared concepts) and Latent Decoupling (which splits shared concepts into spurious model-specific latents)—and proposes Latent Scaling as a diagnostic tool and BatchTopK Crosscoder as an alternative training scheme, substantially improving the reliability of chat-tuning concept discovery.
PAC-Bayes Bounds for Multivariate Linear Regression and Linear Autoencoders: This paper extends PAC-Bayes generalization bounds from single-output linear regression to multivariate linear regression, and further adapts them to linear autoencoders (LAEs) in recommender systems. Through theoretical development, the computational complexity is reduced from O(n⁴) to O(n³), and experiments demonstrate that the bounds are tight and highly correlated with practical metrics such as Recall@K and NDCG@K.
Position: Towards Bidirectional Human-AI Alignment: This paper proposes a Bidirectional Human-AI Alignment framework grounded in a systematic review of 400+ papers, arguing that AI alignment should not be limited to the unidirectional goal of "aligning AI with humans," but must also encompass the critically underexplored direction of "aligning humans with AI," while identifying key gaps in the current research landscape.
R²ec: Towards Large Recommender Models with Reasoning: This paper proposes R²ec, the first unified large recommender model that endogenously integrates reasoning capabilities, achieving joint reasoning chain generation and efficient item prediction via a dual-head architecture, and introduces the RecPO reinforcement learning framework to jointly optimize reasoning and recommendation objectives without any annotated reasoning data.
Radial Neighborhood Smoothing Recommender System: This paper proposes the Radial Neighborhood Estimator (RNE), which approximates latent space distances using the row/column L2 norms of the observed matrix, constructs radial neighborhoods encompassing both overlapping and partially overlapping user–item pairs, and applies local kernel regression for smoothed imputation. RNE outperforms conventional collaborative filtering and matrix factorization methods in both theoretical guarantees and empirical evaluations, while naturally alleviating the cold-start problem.
Semantic Retrieval Augmented Contrastive Learning for Sequential Recommendation: This paper proposes SRA-CL, a framework that leverages the semantic understanding capabilities of LLMs to construct high-quality contrastive sample pairs. By combining semantic retrieval with a learnable sample synthesizer, SRA-CL enhances contrastive learning for sequential recommendation and achieves state-of-the-art performance across four datasets in a plug-and-play manner.
The Coming Crisis of Multi-Agent Misalignment: AI Alignment Must Be a Dynamic and Social Process: A position paper arguing that AI alignment in multi-agent systems (MAS) should be treated as a dynamic, interaction-dependent social process rather than an isolated problem. Drawing on social science theories, the paper analyzes how social structures can undermine collective and individual values, and calls on the AI community to develop dedicated simulation environments, benchmarks, and evaluation frameworks to address this challenge.
The More You Automate, the Less You See: Hidden Pitfalls of AI Scientist Systems: This paper systematically identifies four methodological pitfalls in current AI scientist systems—inappropriate benchmark selection, data leakage, metric misuse, and post-hoc selection bias—through controlled experiments on Agent Laboratory and The AI Scientist v2 using a carefully designed synthetic task (SPR). Both systems exhibit these issues to varying degrees. The paper further demonstrates that auditing trace logs and code achieves 27 percentage points higher detection accuracy than reviewing final papers alone (82% vs. 55%).
Think before Recommendation: Autonomous Reasoning-enhanced Recommender: This paper proposes RecZero (a pure RL paradigm) and RecOne (a hybrid SFT+RL paradigm), abandoning conventional teacher-student distillation. Both approaches leverage GRPO-based reinforcement learning to train a single LLM to autonomously develop reasoning capabilities for rating prediction. A structured "Think-before-Recommendation" template guides step-by-step reasoning (user analysis → item analysis → matching → rating), achieving significant improvements over existing baselines across four datasets.
Transformer Copilot: Learning from The Mistake Log in LLM Fine-tuning: This paper proposes the Transformer Copilot framework, which systematically records a "Mistake Log" during LLM fine-tuning, trains an auxiliary Copilot model to learn the Pilot's error patterns, and rectifies logits at inference time to improve generation quality, achieving up to 34.5% improvement across 12 benchmarks.
TV-Rec: Time-Variant Convolutional Filter for Sequential Recommendation: This paper proposes TV-Rec, a time-variant convolutional filter grounded in graph signal processing that replaces conventional fixed convolutions and self-attention mechanisms, achieving higher expressiveness for sequential recommendation with an average improvement of 7.49% across 6 benchmark datasets.
Validating LLM-as-a-Judge Systems under Rating Indeterminacy: This paper proposes a framework for validating LLM-as-a-Judge systems under rating indeterminacy, replacing forced-choice rating with a "response set" multi-label rating scheme, achieving up to 31% performance improvement in the selected judge system.
VisualLens: Personalization through Task-Agnostic Visual History: This paper proposes VisualLens, a framework that leverages users' task-agnostic visual history (everyday photos) to enable cross-domain personalized recommendation via spectrum user profiles and multimodal large language models. On the newly constructed Google Review-V and Yelp-V datasets, VisualLens surpasses GPT-4o by 2–5% in Hit@3.
Who You Are Matters: Bridging Topics and Social Roles via LLM-Enhanced Logical Recommendation: This paper proposes TagCF, a framework that employs MLLM to extract user role tags and item topic tags, then uses LLM reasoning to construct U2I/I2U logic graphs (causal associations between user roles and item types). Three integration strategies — a tag encoder, contrastive learning augmentation, and logic-based scoring — are used to enhance recommendations. On an industrial platform with hundreds of millions of users, online A/B testing yields a 0.946% improvement in engagement metrics and a 0.102% gain in diversity; offline experiments show an 8.06% improvement in NDCG@10.
Wide-Horizon Thinking and Simulation-Based Evaluation for Real-World LLM Planning with Multifaceted Constraints: This paper proposes MAoP (Multiple Aspects of Planning), a framework that endows LLMs with "wide-horizon thinking" by having a strategist perform multi-aspect pre-planning and routing into a coherent blueprint, enabling the planner to conduct in-depth per-aspect analysis in parallel. Coupled with the Travel-Sim causal simulation benchmark, MAoP substantially outperforms CoT and decomposition-based methods on travel planning tasks; a distilled 3B model achieves a PER of 66.9%.

🧮 Scientific Computing¶

A Regularized Newton Method for Nonconvex Optimization with Global and Local Complexity Guarantees: This paper proposes a novel class of regularizers constructed from current and historical gradients, combined with a conjugate gradient method equipped with negative-curvature detection to solve the regularized Newton equation. Within an adaptive framework that requires no prior knowledge of the Hessian Lipschitz constant, the method simultaneously achieves, for the first time, the optimal global iteration complexity of $O(\epsilon^{-3/2})$ and a quadratic local convergence rate.
Bayesian Surrogates for Risk-Aware Pre-Assessment of Aging Bridge Portfolios: A Bayesian neural network (BNN)-based surrogate model is proposed to replace expensive nonlinear finite element analysis (NLFEA), enabling rapid, uncertainty-aware structural safety pre-assessment of aging bridge portfolios. In a real-world railway case study, the approach saves approximately $370,000 per bridge.
Collapsing Taylor Mode Automatic Differentiation: This paper proposes a collapsing optimization technique for Taylor mode automatic differentiation. By rewriting the computation graph to propagate derivative summation operations upward, it substantially accelerates the evaluation of PDE operators (e.g., Laplacian, general linear PDE operators), achieving speeds superior to nested backpropagation while retaining the low-memory advantage of forward-mode AD.
DeltaPhi: Physical States Residual Learning for Neural Operators in Data-Limited PDE Solving: This paper proposes DeltaPhi, a framework that forgoes direct learning of the input-to-output mapping for PDEs and instead learns residuals between similar physical states. By exploiting the stability of physical systems as implicit data augmentation, DeltaPhi significantly improves the performance of diverse neural operators under data-scarce regimes.
EddyFormer: Accelerated Neural Simulations of Three-Dimensional Turbulence at Scale: EddyFormer is a Transformer architecture based on the Spectral Element Method (SEM) that decomposes the flow field into two parallel streams — LES (large-scale) and SGS (small-scale) — achieving DNS-level accuracy on 3D turbulence at $256^3$ resolution with a 30× speedup, while generalizing well to unseen domains 4× larger.
Enforcing Governing Equation Constraints in Neural PDE Solvers via Training-free Projections: Two training-free post-processing projection methods are proposed—nonlinear LBFGS optimization and local linearization projection—to project the outputs of neural PDE solvers onto the feasible manifold satisfying governing equation constraints. Evaluated on Lorenz/KS/Navier-Stokes, both methods substantially reduce constraint violations and improve accuracy, markedly outperforming physics-informed training.
F-Adapter: Frequency-Adaptive Parameter-Efficient Fine-Tuning in Scientific Machine Learning: This paper presents the first systematic study of parameter-efficient fine-tuning (PEFT) for pretrained large operator models (LOMs) in scientific machine learning. It demonstrates that LoRA exhibits a depth-amplified approximation error lower bound in Fourier layers, whereas Adapter preserves universal approximation capacity. Building on this analysis, the paper proposes the Frequency-Adaptive Adapter (F-Adapter), which allocates adapter capacity according to spectral energy distribution. On 3D Navier-Stokes prediction tasks, F-Adapter achieves state-of-the-art performance while tuning fewer than 2% of parameters.
From Black Hole to Galaxy: Neural Operator Framework for Accretion and Feedback Dynamics: A Neural Operator-based "sub-grid black hole" model is proposed to learn the small-scale (GR)MHD time-evolution operator $u_t \to u_{t+\Delta T}$, replacing hand-crafted closure rules embedded in a multi-level direct numerical simulation framework. This work achieves, for the first time, the capture of intrinsic variability in accretion-driven feedback, with a speedup of $\sim 10^5\times$.
From Images to Physics: Probabilistic Inference of Galaxy Parameters and Emission Lines via VAE & Normalizing Flows: This work proposes a VAE–Normalizing Flow hybrid framework that jointly infers galaxy physical parameters (stellar mass, SFR, redshift, gas-phase metallicity, central black hole mass) and emission line fluxes (Hα, Hβ, [N II], [O III]) in a probabilistic manner from SDSS gri images and photometric data, achieving over 100× speedup relative to SED fitting while providing well-calibrated posterior distributions.
GyroSwin: 5D Surrogates for Gyrokinetic Plasma Turbulence Simulations: This work presents GyroSwin, the first scalable 5D neural surrogate model for gyrokinetic plasma turbulence. It extends the Swin Transformer to the 5D gyrokinetic phase space, employs cross-attention for 3D↔5D interaction, and adopts channelwise mode separation to capture zonal flows. GyroSwin achieves higher accuracy than conventional quasilinear methods while being three orders of magnitude faster than the numerical solver GKW.
Hamiltonian Neural PDE Solvers through Functional Approximation: Grounded in the Riesz representation theorem, this work approximates infinite-dimensional Hamiltonian functionals via learnable integral kernel functionals (IKF). Functional derivatives are obtained through automatic differentiation, yielding an energy-conserving neural PDE solver (HNS) that demonstrates superior stability and generalization on 1D/2D PDEs.
INC: An Indirect Neural Corrector for Auto-Regressive Hybrid PDE Solvers: This paper proposes the Indirect Neural Corrector (INC), which embeds learned correction terms into the right-hand side (RHS) of PDEs rather than directly modifying the state. The approach is theoretically shown to reduce error amplification by a factor of $\mathcal{O}(\Delta t^{-1}+L)$, and achieves substantial improvements in long-term trajectory performance across 6 PDE systems (from 1D chaos to 3D turbulence), with R² gains up to 158.7% and up to 330× acceleration.
Integration Matters for Learning PDEs with Backward SDEs: This paper identifies the root cause of why standard BSDE methods underperform PINNs — an irreducible discretization bias introduced by Euler-Maruyama integration — and proposes Heun-BSDE based on the Stratonovich formulation to fully eliminate this bias, achieving competitive performance against PINNs on high-dimensional PDEs.
Multi-Trajectory Physics-Informed Neural Networks for HJB Equations with Hard-Zero Terminal Inventory: Optimal Execution on Synthetic & SPY Data: To address the hard-zero terminal inventory constraint ($X_T=0$) in HJB equations arising from optimal trade execution, this paper proposes Multi-Trajectory PINN (MT-PINN). Through a rollout-based terminal loss and a $\lambda$-curriculum training strategy, MT-PINN significantly outperforms vanilla PINN on both synthetic benchmarks and live SPY backtesting, achieving a substantial reduction in terminal inventory violation rates.
Neural Emulator Superiority: When Machine Learning for PDEs Surpasses its Training Data: This work challenges the prevailing assumption that the accuracy of neural PDE emulators is bounded by that of their training data (i.e., the numerical solver). It discovers and rigorously defines the phenomenon of emulator superiority—neural networks trained solely on low-accuracy solver data can, when evaluated against high-accuracy reference solutions, outperform the very solver that generated their training data.
Neuro-Spectral Architectures for Causal Physics-Informed Networks: NeuSA integrates classical spectral methods with Neural ODEs: the PDE is projected onto a spectral basis (Fourier) to obtain an ODE system, which is then solved by a NODE that learns the dynamical evolution. This architecture-level design eliminates the spectral bias and causality violations inherent in conventional PINNs, achieving errors 1–2 orders of magnitude lower than baselines on wave, Burgers, and sine-Gordon equations while training faster.
From Images to Physics: Probabilistic Inference of Galaxy Parameters and Emission Lines via VAE–Normalizing Flows: A two-stage VAE–Normalizing Flow probabilistic inference framework is proposed that infers stellar mass, SFR, redshift, black hole mass, metallicity, and emission line fluxes directly from SDSS galaxy images and photometric data, surpassing existing non-spectroscopic methods in accuracy while being over 100× faster than SED fitting.
One-Shot Transfer Learning for Nonlinear PDEs with Perturbative PINNs: By combining perturbation theory with PINNs, this work decomposes nonlinear PDEs into a sequence of linear subproblems. After learning the latent space of the linear operator via a Multi-Head PINN, transfer to new PDE instances is achieved through a closed-form solution within 0.2 seconds, attaining errors on the order of $10^{-3}$.
Physics-Guided Machine Learning for Uncertainty Quantification in Turbulence Models: This paper proposes a hybrid ML–EPM framework that employs a lightweight CNN to learn a correction mapping from RANS turbulent kinetic energy fields to DNS ground truth, using the learned corrections to modulate the perturbation magnitude of the Eigenspace Perturbation Method (EPM). The approach reduces uncertainty estimation errors by 1–2 orders of magnitude while preserving physical consistency.
Physics-Informed Neural Networks with Fourier Features and Attention-Driven Decoding: This paper proposes Spectral PINNsformer (S-Pformer), which replaces the encoder of PINNsformer with Fourier feature embeddings and adopts a decoder-only Transformer architecture. S-Pformer achieves superior performance on multiple PDE benchmarks while reducing parameter count by 18.6%, effectively alleviating the spectral bias problem.
Stable Minima of ReLU Neural Networks Suffer from the Curse of Dimensionality: The Neural Shattering Phenomenon: This paper investigates the generalization properties of stable minima (flat minima) in two-layer overparameterized ReLU networks. It proves that while flatness does imply generalization, the convergence rate deteriorates exponentially with input dimension (i.e., the curse of dimensionality applies), forming an exponential separation from low-norm solutions (weight decay) that are immune to this curse. The paper further identifies the "neural shattering" phenomenon as the geometric mechanism underlying failure in high dimensions.
Symbolic Regression Is All You Need: From Simulations to Scaling Laws in Binary Neutron Star Mergers: This work applies Symbolic Regression (SR) to automatically discover analytic calibration relations for post-merger accretion disk mass in binary neutron star mergers from numerical relativity simulation data. The resulting compact expressions comprehensively outperform existing empirical fitting formulae in the literature in terms of predictive accuracy, generalization, and interpretability.
The Primacy of Magnitude in Low-Rank Adaptation: This paper reveals that weight update magnitude is the fundamental driver of performance in LoRA, unifying the influence of learning rate, scaling factor, and initialization strategy under a single framework. It further proposes LoRAM—an efficient initialization method based on deterministic orthogonal bases and magnitude scaling—that matches or surpasses spectral initialization methods without requiring SVD.
Towards Universal Neural Operators through Multiphysics Pretraining: This paper proposes an adapter-based multiphysics pretraining framework for neural operators. By treating lifting/projection layers as problem-specific adapters and freezing shared kernel integration operator layers, the framework enables transfer learning across PDE problems, substantially reducing fine-tuning cost while improving generalization.

🎬 Video Generation¶

Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation: This paper proposes AAPT (Autoregressive Adversarial Post-Training), which converts a pretrained video diffusion model into an autoregressive real-time video generator via adversarial training. The model requires only one forward pass per frame (1NFE), employs student-forcing training to reduce error accumulation, and achieves real-time streaming generation at 736×416 resolution and 24fps on a single H100 GPU, supporting videos up to one minute in length (1440 frames).
DisMo: Disentangled Motion Representations for Open-World Motion Transfer: DisMo learns abstract motion representations that are agnostic to appearance, pose, and category from raw videos via a dual-stream architecture (motion extractor + frame generator) and an image-space reconstruction objective. It enables open-world motion transfer across categories and viewpoints, and significantly outperforms video representation models such as V-JEPA on zero-shot action classification.
Force Prompting: Video Generation Models Can Learn and Generalize Physics-based Control Signals: This paper proposes Force Prompting, which uses physical forces (local point forces and global wind forces) as control signals for video generation models. Using only ~15K synthetic training videos (Blender flags and rolling balls) and a single day of training on 4×A100 GPUs, the method achieves remarkable generalization across diverse real-world scenes with varying objects, materials, and geometries, including preliminary mass understanding capabilities.
Foresight: Adaptive Layer Reuse for Accelerated and High-Quality Text-to-Video Generation: This paper proposes Foresight, a training-free adaptive layer reuse framework that establishes per-layer MSE thresholds during a warmup phase and dynamically decides at inference time whether to reuse cached features or recompute each layer. Evaluated on 5 video generation models, Foresight achieves superior quality and speed trade-offs compared to static methods, with up to 2.23× acceleration.
LeMiCa: Lexicographic Minimax Path Caching for Efficient Diffusion-Based Video Generation: This paper proposes LeMiCa, a training-free acceleration framework for diffusion-based video generation that formulates cache scheduling as a lexicographic minimax path optimization problem on a directed acyclic graph (DAG), achieving simultaneous gains in speed and quality (2.9× speedup on Latte; LPIPS as low as 0.05 on Open-Sora) via global error control.
MagCache: Fast Video Generation with Magnitude-Aware Cache: This paper discovers that the magnitude ratio of residual outputs between adjacent timesteps in video diffusion models follows a universally monotonically decreasing pattern across models and prompts — termed the "Unified Magnitude Law" — and proposes MagCache: a method that accurately models skip-step error accumulation via magnitude ratios, adaptively skips redundant timesteps and reuses cached outputs with only a single calibration sample, achieving 2.10–2.68× speedup on Open-Sora, CogVideoX, Wan 2.1, and HunyuanVideo while outperforming TeaCache and other existing methods across all three metrics of LPIPS, SSIM, and PSNR.
Photography Perspective Composition: Towards Aesthetic Perspective Recommendation: This paper proposes a novel "Photography Perspective Composition" (PPC) paradigm that goes beyond traditional cropping-based approaches. It constructs a perspective transformation dataset via 3D reconstruction, generates recommended viewpoints through Image-to-Video generation, aligns with human preferences via RLHF, and evaluates perspective quality using a PQA model.
PhysCtrl: Generative Physics for Controllable and Physics-Grounded Video Generation: PhysCtrl employs diffusion models to learn the physical dynamics distribution of four material types (elastic, sand, plasticine, and rigid bodies), representing dynamics as 3D point trajectories. A diffusion model incorporating spatiotemporal attention and physics constraints is trained on 550K synthetic animations; the generated trajectories drive a pretrained video model to achieve high-fidelity physics video generation controllable by force and material parameters.
PoseCrafter: Extreme Pose Estimation with Hybrid Video Synthesis: This paper proposes PoseCrafter, a training-free framework for extreme pose estimation. It synthesizes high-fidelity intermediate frames via Hybrid Video Generation (HVG, a two-stage pipeline combining DynamiCrafter and ViewCrafter) to address pose estimation for image pairs with minimal or no overlap, and employs a Feature Matching Selector (FMS) to efficiently identify the most informative intermediate frames. The method achieves significant improvements in extreme pose estimation accuracy across four datasets.
Radial Attention: O(n log n) Sparse Attention with Energy Decay for Long Video Generation: Radial Attention identifies a "spatiotemporal energy decay" phenomenon in video diffusion models, wherein attention scores decay exponentially with spatiotemporal distance. Based on this finding, the authors design a static sparse attention mask with O(n log n) complexity, achieving up to 3.7× inference speedup on models such as HunyuanVideo and Wan2.1, and enabling 4× longer video generation via LoRA fine-tuning.
RLGF: Reinforcement Learning with Geometric Feedback for Autonomous Driving Video Generation: This paper is the first to systematically quantify geometric distortions in autonomous driving video generation. It proposes the RLGF framework, which leverages hierarchical geometric rewards (vanishing point → lane lines → depth → occupancy) combined with a latent-space sliding window optimization strategy to improve 3D object detection mAP by 12.7 absolute percentage points (25.75→31.42), substantially closing the performance gap between synthetic and real data.
S²Q-VDiT: Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation: To address the high calibration variance and optimization difficulty caused by extremely long token sequences in video diffusion Transformers, this paper proposes the S²Q-VDiT framework. By combining Hessian-aware salient data selection and attention-guided sparse token distillation, it achieves lossless quantization under W4A6 settings for the first time, yielding 3.9× model compression and 1.3× inference speedup.
Safe-Sora: Safe Text-to-Video Generation via Graphical Watermarking: Safe-Sora is the first method to embed graphical watermarks (e.g., logo images) directly into the video generation pipeline. It employs a coarse-to-fine adaptive matching strategy to assign watermark patches to visually similar frames and regions, and designs a 3D wavelet transform-enhanced Mamba architecture for spatiotemporal fusion. The method substantially outperforms all baselines in both video quality (FVD 3.77 vs. the second-best 154.35) and watermark fidelity.
Scaling RL to Long Videos: This paper proposes LongVILA-R1, a full-stack framework that extends VLM reasoning to long videos (up to 8192 frames) via a 104K long-video reasoning dataset, a two-stage CoT-SFT + RL training pipeline, and the MR-SP multimodal reinforcement sequence parallelism system, achieving 65.1%/71.1% on VideoMME.
Seeing the Wind from a Falling Leaf: An end-to-end differentiable inverse graphics framework is proposed that jointly models object geometry/physical properties, force field representations, and physical processes to recover invisible force fields (e.g., wind fields) from video via backpropagation, while supporting physics-based video generation and editing.
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion: This paper proposes the Self Forcing training paradigm, which eliminates the exposure bias caused by train-inference distribution mismatch in Teacher Forcing and Diffusion Forcing by performing autoregressive self-rollout during training and applying a holistic video-level distribution matching loss (DMD/SiD/GAN). Built on Wan2.1-1.3B, it achieves real-time streaming video generation at 17 FPS on a single GPU while matching or surpassing the quality of bidirectional diffusion models that are orders of magnitude slower.
Stable Cinemetrics: Structured Taxonomy and Evaluation for Professional Video Generation: This paper proposes SCINE (Stable Cinemetrics), the first structured evaluation framework targeting professional video production. It defines a hierarchical taxonomy with 76 fine-grained cinematic control nodes, accompanied by large-scale professional annotation (80+ film practitioners, 20K+ videos, 248K annotations), revealing significant deficiencies of current state-of-the-art T2V models in professional cinematic control.
Training-Free Efficient Video Generation via Dynamic Token Carving: This paper proposes Jenga, a training-free inference acceleration framework for video DiTs that achieves 8.83× speedup on HunyuanVideo with only a 0.01% drop in VBench score. The framework combines dynamic block attention carving (sparse KV block selection after token reordering via 3D space-filling curves) and a progressive resolution strategy (coarse-to-fine denoising), which operate orthogonally.
Video Diffusion Models Excel at Tracking Similar-Looking Objects Without Supervision: This paper reveals that pretrained video diffusion models naturally learn motion representations suitable for tracking during high-noise denoising stages, and proposes the TED framework that fuses motion and appearance features, achieving up to 10 percentage points improvement over existing self-supervised methods on tracking similar-looking objects.
Video Killed the Energy Budget: Characterizing the Latency and Power Regimes of Open Text-to-Video Models: This paper presents a systematic analysis of latency and energy consumption for open-source text-to-video (T2V) models. It establishes a FLOP-based analytical model to predict scaling laws for WAN2.1 — quadratic scaling along spatial/temporal dimensions and linear scaling with respect to denoising steps — and provides a cross-model energy benchmark across 7 T2V models.
VMDT: Decoding the Trustworthiness of Video Foundation Models: This paper introduces VMDT (Video-Modal DecodingTrust), the first unified benchmark platform for evaluating the trustworthiness of T2V and V2T video foundation models across five dimensions—safety, hallucination, fairness, privacy, and adversarial robustness—covering large-scale assessments of 7 T2V and 19 V2T models, and revealing the complex relationship between model scale and trustworthiness.
VORTA: Efficient Video Diffusion via Routing Sparse Attention: This paper proposes VORTA, a framework that achieves end-to-end 1.76× acceleration of video diffusion Transformers without quality degradation, through bucketed coreset attention (for modeling long-range dependencies) and a signal-aware routing mechanism (for adaptively selecting sparse attention branches). Combined with caching and distillation methods, it achieves up to 14.41× acceleration.
VSA: Faster Video Diffusion with Trainable Sparse Attention: This paper proposes VSA (Video Sparse Attention), an end-to-end trainable, hardware-aligned sparse attention mechanism with a hierarchical coarse-fine design: a coarse-grained stage predicts key token positions via cube pooling, and a fine-grained stage performs token-level attention within the predicted block-sparse regions. VSA accelerates both training and inference of video DiTs simultaneously: pretraining from scratch achieves a 2.53× reduction in training FLOPs without quality loss, while adapting to Wan2.1-1.3B yields a 6× attention speedup and reduces end-to-end inference time from 31s to 18s.

💻 Code Intelligence¶

A Self-Improving Coding Agent: This paper proposes SICA (Self-Improving Coding Agent), a coding agent capable of autonomously editing its own codebase to improve performance. By eliminating the distinction between meta-agent and target-agent, SICA achieves iterative self-improvement, advancing from 17% to 53% on a subset of SWE-Bench Verified.
A Stochastic Differential Equation Framework for Multi-Objective LLM Interactions: This paper models multi-objective optimization in iterative LLM interactions as an SDE (drift-diffusion process), quantifies inter-objective coupling via an interference matrix, and analyzes strategy convergence behavior through eigenvalue spectral analysis. Validation on code generation (three objectives: security, efficiency, functionality) demonstrates convergence rates ranging from 0.33 to 1.29 and predictability up to $R^2 = 0.74$ across different strategies.
AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy: AstroVisBench introduces the first code benchmark for evaluating LLMs on astronomical scientific computing and visualization. It extracts 864 tasks (processing + visualization) from 110 Jupyter Notebooks, and designs a dual evaluation pipeline (execution-based variable inspection + VLM-as-Judge visualization scoring, achieving Spearman ρ=0.822 with expert ratings). Evaluation of 8 state-of-the-art models reveals that Gemini 2.5 Pro performs best, yet attains only a 15.7% error-free rate, with FileNotFoundError accounting for 43% of all errors.
VeriMaAS: Automated Multi-Agent Workflows for RTL Design: VeriMaAS proposes a framework for automatically composing multi-agent workflows for RTL code generation. Its core innovation is the direct integration of formal verification feedback from HDL tools (Yosys synthesis + OpenSTA timing analysis) into workflow orchestration, achieving a 2–12% pass@1 improvement on VeriThoughts while requiring only a few hundred samples for controller tuning—an order of magnitude fewer than full fine-tuning.
Co-Evolving LLM Coder and Unit Tester via Reinforcement Learning: This paper proposes CURE, a framework in which a single LLM simultaneously assumes the roles of code generator and unit test generator. Cross-execution between generated code and generated tests constructs a pairwise reward matrix; theoretically derived reward signals then drive reinforcement learning. Without any ground-truth code annotations, CURE achieves co-evolution of both code generation and unit test generation capabilities, substantially outperforming dedicated coder models of comparable scale across five programming benchmarks.
CoRe: Benchmarking LLMs' Code Reasoning Capabilities through Static Analysis Tasks: This paper introduces CoRe, a high-quality benchmark comprising 12,553 manually validated task instances. Through three categories of fundamental static analysis tasks—data dependency, control dependency, and information flow—CoRe directly evaluates the code semantic reasoning capabilities of LLMs, revealing that current models remain severely deficient on tasks requiring multi-step reasoning, such as trace generation and source enumeration.
Embedding Alignment in Code Generation for Audio: A dual-MLP + InfoNCE contrastive learning framework is proposed to align code embeddings (distilroberta-base) and audio embeddings (wav2vec2) into a shared space, enabling LLM-based code generation pipelines to infer musical similarity directly from code without compilation or execution. CKA improves from 0.090 to 0.590.
FlyLoRA: Boosting Task Decoupling and Parameter Efficiency via Implicit Rank-Wise Mixture-of-Experts: Inspired by the fly olfactory circuit, FlyLoRA replaces the down-projection matrix $A$ in LoRA with a frozen sparse random projection and employs top-$k$ activation selection to realize implicit rank-wise MoE routing. This design eliminates routing parameters, reduces intra-task interference, and naturally supports multi-task model merging by exploiting the near-orthogonality of random projections.
FractalBench: Diagnosing Visual-Mathematical Reasoning Through Recursive Program Synthesis: This paper introduces FractalBench, a benchmark for diagnosing visual-mathematical reasoning in MLLMs via fractal image program synthesis. Comprising 12 classical fractals, 610 test images, and evaluations across 4 MLLMs, it reveals that while 76% of generated code is executable, only 4% is visually correct, exposing fundamental deficiencies in recursive abstraction capabilities.
Learning From Design Procedure To Generate CAD Programs for Data Augmentation: This paper proposes a CAD program data augmentation paradigm inspired by industrial design workflows. By providing reference surface programs and design procedure descriptions as LLM prompts, the method guides the generation of CAD programs containing B-Spline organic shapes, substantially narrowing the geometric complexity gap between public CAD datasets and industrial-grade designs.
Learning to Solve Complex Problems via Dataset Decomposition: This paper proposes Decomp, a method that employs a teacher model (GPT-4o) to recursively decompose complex math problems into simpler subproblems along reasoning steps, constructs a concept dependency graph to quantify difficulty, and trains student models following an easy-to-hard curriculum. Qwen2.5-1.5B achieves 51.6% on MATH-500 (surpassing MuggleMath's 50.4% with 147K samples), while Qwen3-4B reaches 16.7% on AIME2025 using only 385 samples (surpassing Qwen2.5-72B's 15.0%).
MaintainCoder: Maintainable Code Generation Under Dynamic Requirements: This work is the first to systematically define and address the maintainability problem in LLM-based code generation, contributing both a benchmark and a method. MaintainBench evaluates code maintainability under requirement evolution using 4 change patterns and dynamic metrics; MaintainCoder integrates the Waterfall model, design patterns, and 6 specialized agents, achieving 60%+ improvement on dynamic maintainability metrics while also improving initial code correctness.
MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research: This paper proposes MLR-Bench, a comprehensive benchmark comprising 201 open-ended ML research tasks, accompanied by MLR-Judge (an LLM-based evaluation framework) and MLR-Agent (a modular research agent). The study finds that state-of-the-art coding agents fabricate or fail to verify experimental results in approximately 80% of cases, exposing a fundamental bottleneck in AI-automated scientific research.
Once Upon an Input: Reasoning via Per-Instance Program Synthesis: This paper proposes PIPS (Per-Instance Program Synthesis), which iteratively refines programs through instance-level program synthesis and structured feedback, while dynamically selecting between direct reasoning and program synthesis via a confidence measure. PIPS achieves an 8.6% improvement in harmonic mean accuracy across 30 benchmarks.
Preserving LLM Capabilities through Calibration Data Curation: From Analysis to Optimization: This paper systematically investigates how compositional properties of calibration data (sequence length, sample size, source, format) and domain correspondence affect capability preservation after LLM compression. It finds that representativeness and diversity in the activation space are the fundamental determinants of calibration data quality, and proposes a three-stage calibration data curation framework, COLA.
Principled Fine-tuning of LLMs from User-Edits: A Medley of Preference, Supervision, and Reward: This paper systematically investigates how to fine-tune LLMs using user-edit data, unifying three feedback types—preference, supervision, and cost—and proposes a simple ensembling procedure that achieves robust adaptation across diverse user distributions.
Program Synthesis via Test-Time Transduction: This paper proposes SYNTRA, a framework that reframes program synthesis as transductive learning — at test time, it leverages visible test inputs and LLM judgment to iteratively eliminate inconsistent candidate program hypotheses. A greedy maximin algorithm minimizes the number of LLM queries, achieving accuracy improvements of up to 196% across 4 benchmarks.
QiMeng-SALV: Signal-Aware Learning for Verilog Code Generation: This paper proposes QiMeng-SALV, a signal-aware learning method that extracts functionally correct signal-level code snippets from partially incorrect Verilog modules as reward signals for DPO training, elevating the optimization granularity from module level to signal level and achieving SOTA on VerilogEval and RTLLM.
Searching Latent Program Spaces: This paper proposes the Latent Program Network (LPN), which uses an encoder to map input–output examples into a latent program representation, then performs gradient-based search in the latent space at test time to adapt to new tasks. LPN substantially outperforms in-context learning and test-time training methods on the ARC-AGI benchmark.
SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents: A fully automated pipeline is developed to continuously mine real-world software engineering interaction tasks from GitHub, producing the SWE-rebench dataset of 21,000+ executable Python tasks and a decontaminated benchmark. The work reveals that several models exhibit contamination-inflated performance on SWE-bench Verified (e.g., DeepSeek-V3: 39.7% on SWE-bench vs. 21.3% on SWE-rebench).
Table2LaTeX-RL: High-Fidelity LaTeX Code Generation from Table Images via Reinforced Multimodal Language Models: This paper proposes VSGRPO — a dual-reward reinforcement learning strategy based on GRPO — that jointly optimizes a structure-level reward (TEDS-Structure) and a visual fidelity reward (CW-SSIM on rendered images). The fine-tuned MLLM (only 3B parameters) surpasses GPT-4o and models with 72B+ parameters on the table-image-to-LaTeX generation task, with particularly significant gains on complex tables.
Text-to-Code Generation for Modular Building Layouts in Building Information Modeling: This paper proposes Text2MBL, a framework that translates natural language descriptions into executable BIM code (rather than coordinate sequences). Through an object-oriented code architecture and LLM fine-tuning, it enables automatic generation of modular building layouts, achieving 10%+ IoU improvement in geometric consistency over coordinate-driven methods.

🔗 Causal Inference¶

A Principle of Targeted Intervention for Multi-Agent Reinforcement Learning: This paper proposes a Targeted Intervention paradigm grounded in Multi-Agent Influence Diagrams (MAIDs), which applies Pre-Strategy Intervention (PSI) exclusively to a single target agent to guide the entire multi-agent system toward a preferred Nash equilibrium satisfying additional desired outcomes, without requiring global intervention over all agents.
An Analysis of Causal Effect Estimation Using Outcome Invariant Data Augmentation: This paper presents the first systematic analysis of outcome invariant data augmentation (DA) for causal effect estimation. It proves that when DA operations preserve the outcome variable, they are equivalent to soft interventions on the treatment variable, thereby reducing confounding bias. The paper further proposes an IV-like (IVL) regression framework that treats DA parameters as "instrument-like" variables, and reduces bias further through adversarial DA composition.
Bi-Level Decision-Focused Causal Learning for Large-Scale Marketing Optimization: This paper proposes Bi-DFCL, a bilevel optimization framework that jointly leverages observational (OBS) data and randomized controlled trial (RCT) data to train marketing resource allocation models. The upper level trains a Bridge Network with unbiased decision loss on RCT data to dynamically correct the bias of the lower level trained on OBS data. The framework further introduces differentiable surrogate decision losses (PPL/PIFD) grounded in the primal problem and an implicit differentiation algorithm, addressing the predict-then-optimize inconsistency and the bias-variance dilemma of conventional two-stage methods. The system has been deployed at scale on Meituan.
Causality-Induced Positional Encoding for Transformer-Based Representation Learning of Non-Sequential Features: CAPE learns the causal DAG structure among features from tabular data, embeds it into hyperbolic space to generate causality-aware rotary positional encodings (RoPE), enabling Transformers to process non-sequential yet causally structured feature data, with significant performance gains on downstream multi-omics tasks.
Conformal Prediction for Causal Effects of Continuous Treatments: This work is the first to construct conformal prediction intervals for causal effects of continuous treatments (e.g., drug dosage) by parameterizing intervention-induced propensity shifts via a tilting function family and employing quantile regression, providing finite-sample $1-\alpha$ coverage guarantees under both known and unknown propensity score settings.
Counterfactual Reasoning for Steerable Pluralistic Value Alignment of Large Language Models: This paper proposes the COUPLE framework, which constructs a Structural Causal Model (SCM) to model the dependencies and priorities among multi-dimensional values, and leverages counterfactual reasoning to achieve steerable alignment of LLMs toward arbitrary fine-grained pluralistic value objectives.
Cyclic Counterfactuals under Shift–Scale Interventions: This paper establishes a theoretical framework for counterfactual reasoning under shift–scale soft interventions in cyclic (non-DAG) structural causal models (SCMs). It proves that a global contraction condition guarantees unique solvability of cyclic SCMs and derives sub-Gaussian concentration inequalities for counterfactual distributions.
Demystifying Spectral Feature Learning for Instrumental Variable Regression: This paper establishes rigorous generalization error bounds for spectral feature-based nonparametric instrumental variable (NPIV) regression, revealing that performance is jointly governed by two factors: spectral alignment between the structural function and the conditional expectation operator (approximation error) and the rate of singular value decay (estimation error). A Good-Bad-Ugly trichotomy is proposed along with data-driven diagnostic tools.
Differentiable Structure Learning and Causal Discovery for General Binary Data: This paper proposes a general differentiable structure learning framework based on the Multivariate Bernoulli Distribution (MVB) that makes no assumptions about the specific data-generating process, captures arbitrary higher-order dependencies among binary discrete variables, and proves that while DAGs are not identifiable in the general setting, the minimal equivalence class (Markov equivalence class) is recoverable.
Do-PFN: In-Context Learning for Causal Effect Estimation: This paper proposes Do-PFN, which extends Prior-data Fitted Networks (PFN) to causal effect estimation. A Transformer is pre-trained on large-scale synthetic SCM data to perform in-context causal reasoning, enabling prediction of causal intervention distributions (CID) and CATE from observational data alone—without requiring causal graph knowledge or the unconfoundedness assumption—achieving strong performance on both synthetic and semi-synthetic benchmarks.
Domain-Adapted Granger Causality for Real-Time Cross-Slice Attack Attribution in 6G Networks: This paper proposes a domain-adapted Granger causality framework for 6G network slicing that integrates enhanced Granger causality testing with network resource contention modeling to enable real-time cross-slice attack attribution, achieving 89.2% accuracy and 87 ms response time across 1,100 attack scenarios, substantially outperforming existing statistical, deep learning, and causal discovery methods.
Few-Shot Knowledge Distillation of LLMs With Counterfactual Explanations: This paper proposes CoD (Counterfactual-explanation-infused Distillation), which injects counterfactual explanations into few-shot training sets to precisely map the teacher's decision boundary, achieving significant improvements over standard distillation methods across 6 datasets using only 8–512 samples.
From Black-box to Causal-box: Towards Building More Interpretable Models: This paper proposes a formal definition of causal interpretability, proves that both black-box models and concept bottleneck models fail to satisfy this property, establishes a complete graphical criterion for identifying which model architectures can consistently answer counterfactual queries, and reveals a fundamental tradeoff between causal interpretability and predictive accuracy.
GST-UNet: A Neural Framework for Spatiotemporal Causal Inference with Time-Varying Confounding: This paper proposes GST-UNet, which integrates a U-Net spatiotemporal encoder with iterative G-computation to estimate location-specific conditional average potential outcomes (CAPOs) from a single spatiotemporal observational trajectory. The framework simultaneously handles interference, spatial confounding, temporal carry-over effects, and time-varying confounding, and is validated on a real-world causal analysis of wildfire smoke effects on respiratory hospitalization rates in California.
It's Hard to Be Normal: The Impact of Noise on Structure-agnostic Estimation: This paper proves that Double Machine Learning (DML) is minimax optimal under Gaussian treatment noise ($O(\epsilon^2 + n^{-1/2})$), but becomes suboptimal under non-Gaussian noise. It proposes Agnostic Cumulant-based Estimation (ACE), which exploits higher-order cumulants to achieve $r$-th order insensitivity $O(\epsilon^r + n^{-1/2})$.
LLM Interpretability with Identifiable Temporal-Instantaneous Representation: This paper proposes an identifiable temporal causal representation learning framework for the high-dimensional activation spaces of LLMs. By adopting a linearized formulation that jointly models time-lagged and instantaneous causal relationships, it resolves the computational bottleneck that prevents existing CRL methods from scaling to LLM-scale dimensions, while preserving theoretical identifiability guarantees.
Performative Validity of Recourse Explanations: This paper formally analyzes the "performative" effects of recourse explanations — when a large number of rejected applicants act on recourse recommendations, their collective behavior induces distribution shift that renders recourse invalid after model retraining — and proves that only Improvement-based Causal Recourse (ICR), which intervenes solely on causal variables, preserves "performative validity" under broad conditions.
Practical do-Shapley Explanations with Estimand-Agnostic Causal Inference: This paper proposes the Estimand-Agnostic (EA) approach and the Frontier-Reducibility Algorithm (FRA) for efficient computation of causal Shapley values (do-SV). By training a single SCM to learn the observational distribution, the framework answers arbitrary identifiable causal queries and reduces the number of coalitions requiring evaluation by approximately 90% via coalition reduction.
Revealing Multimodal Causality with Large Language Models: This paper proposes MLLM-CD, the first framework for causal discovery from multimodal unstructured data (text + images). It identifies causal variables via contrastive factor discovery, infers causal structure through statistical methods, and resolves structural ambiguity via iterative multimodal counterfactual reasoning.
Root Cause Analysis of Outliers with Missing Structural Knowledge: Two simple and efficient algorithms are proposed for root cause analysis using only marginal anomaly scores: SMOOTH TRAVERSAL (known causal graph — finds the node with the largest score jump along causal paths) and SCORE ORDERING (unknown causal graph — ranks nodes by score and returns the top-$k$). Both algorithms provide nonparametric probabilistic guarantees under polytree structure and operate on a single anomalous sample.
Transferring Causal Effects using Proxies: This paper proposes a multi-domain causal effect transfer method based on proxy variables. Given that only proxy variable $W$ is observed in the target domain, the method leverages multi-source domain data to identify and estimate the interventional distribution under unobserved confounders in the target domain, and provides two consistent estimators with asymptotic confidence intervals.

⚛️ Physics¶

AstroCo: Self-Supervised Conformer-Style Transformers for Light-Curve Embeddings: This paper proposes AstroCo, a self-supervised encoder that introduces the Conformer architecture (attention + depthwise separable convolution + gating) for irregular astronomical light curves. On the MACHO dataset, AstroCo reduces reconstruction error by 61–70% compared to Astromer v1/v2 and improves few-shot classification macro-F1 by approximately 7%.
Dynamic Diffusion Schrödinger Bridge in Astrophysical Observational Inversions: This paper proposes Astro-DSB, a Diffusion Schrödinger Bridge-based framework for modeling astrophysical inverse problems. It directly learns a probabilistic mapping from observables to true physical distributions, requires only 25% of the training cost of conditional DDPM, demonstrates significant generalization advantages in out-of-distribution (OOD) evaluation, and is successfully applied to real observational data from Taurus B213.
Exoplanet Formation Inference Using Conditional Invertible Neural Networks: A conditional invertible neural network (cINN) trained on 15,777 synthetic planets infers planet formation parameters (disk mass, turbulent $\alpha$, dust-to-gas ratio) from observables (planet mass, orbital distance), achieving probabilistic parameter retrieval ~10⁶× faster than physical simulations. Multi-planet system data is shown to yield more robust inference than single-planet data.
FAIR Universe HiggsML Uncertainty Dataset and Competition: This work provides a standardized dataset of 280 million simulated LHC collision events and a competition platform featuring six parameterized systematic biases (detector calibration + background composition) alongside an asymmetric coverage penalty metric. Participants are required to construct robust 68.27% confidence intervals for the Higgs signal strength $\mu$. The winning solutions, based on profile-free surrogate modeling, achieve confidence intervals approximately 20% narrower than conventional binned methods.
FEAT: Free Energy Estimators with Adaptive Transport: This paper proposes the FEAT framework, which employs stochastic interpolants to learn transport maps between two thermodynamic systems. Building on the escorted Jarzynski equality and the controlled Crooks theorem, FEAT provides consistent, minimum-variance free energy difference estimators along with variational upper and lower bounds, thereby unifying equilibrium and non-equilibrium approaches.
From Simulations to Surveys: Domain Adaptation for Galaxy Observations: This work constructs a domain adaptation pipeline from simulated galaxies (TNG50) to real survey observations (SDSS) via feature-level alignment using Euclidean distance, optimal transport, and a top-$k$ soft-matching loss with trainable weight scheduling, improving target-domain morphology classification accuracy from 46.8% (no adaptation) to 87.3%, and Macro F1 from 0.298 to 0.626.
Knowledge is Overrated: A Zero-Knowledge ML and Cryptographic Hashing-Based Framework for Verifiable, Low Latency Inference at the LHC: This paper proposes PHAZE, a framework that combines cryptographic hashing (Rabin fingerprinting) and zero-knowledge machine learning (zkML) to enable verifiable early-exit inference at LHC trigger latency, achieving a theoretical online latency of ~152–253 ns while providing built-in anomaly detection capability.
Latent Representation Learning in Heavy-Ion Collisions with MaskPoint Transformer: This work introduces a masked point cloud Transformer autoencoder to heavy-ion collision analysis. Through a two-stage paradigm of self-supervised pre-training followed by supervised fine-tuning, the model learns nonlinear latent representations substantially stronger than those of PointNet—reducing PC1 distribution overlap from 2.42% to 0.27%—providing a general feature learning framework for studying QGP properties.
Multi-Modal Masked Autoencoders for Learning Image-Spectrum Associations for Galaxy Evolution and Cosmology: This paper applies a Multimodal Masked Autoencoder (MMAE) to jointly model galaxy images (HSC-PDR2, five bands) and spectra (DESI-DR1), constructing a cross-modal dataset GalaxiesML-Spectra of 134,533 galaxies. Under a 75% masking ratio, the model reconstructs major spectral emission lines and image morphology. When spectra are entirely absent at inference, the model achieves $\sigma_{\text{NMAD}}=0.016$ for redshift prediction using images alone, outperforming AstroCLIP while extending the redshift range to $z \sim 4$ for the first time.
Neural Deprojection of Galaxy Stellar Mass Profiles: A neural network approach is proposed to map Nuker galaxy profile parameters to analytically deprojectable Multi-Gaussian Expansion (MGE) components, enabling stellar mass modeling of galaxies without optical imaging. The method is integrated into the differentiable dynamical modeling pipeline SuperMAGE for Bayesian inference of supermassive black hole (SMBH) masses.
POLARIS: A High-contrast Polarimetric Imaging Benchmark Dataset for Exoplanetary Disk Representation Learning: This work introduces POLARIS, the first ML benchmark dataset for exoplanetary polarimetric imaging (921 VLT/SPHERE/IRDIS polarimetric images + 75,910 preprocessed exposures), and proposes the Diff-SimCLR framework (diffusion-augmented contrastive learning), achieving 93% accuracy on the reference-star vs. target-star classification task with fewer than 10% manual annotations.
Quantum Doubly Stochastic Transformers: This paper proposes QDSFormer (Quantum Doubly Stochastic Transformer), replacing softmax with a variational quantum circuit QontOT to generate doubly stochastic attention matrices. Both theoretical analysis and experiments demonstrate that quantum-circuit-generated DSMs are more diverse and better at preserving information, consistently outperforming standard ViT and Sinkformer on multiple small-scale visual recognition tasks.
Simulation-Based Inference for Neutrino Interaction Model Parameter Tuning: This work presents the first application of simulation-based inference (SBI) to neutrino interaction model parameter tuning. Using neural posterior estimation (NPE), the method learns the posterior distribution of 4 physical parameters from 200K GENIE-simulated 58-bin histograms, and accurately recovers the ground-truth parameter values on mock data from the MicroBooNE Tune.
The Pareto Frontier of Resilient Jet Tagging: This work systematically evaluates the AUC–resilience trade-off across multiple architectures (DNN/PFN/EFN/ParT) for LHC jet tagging tasks, revealing that more complex models achieve higher AUC but exhibit stronger Monte Carlo model dependence. A Pareto frontier is constructed, and a case study demonstrates that low-resilience classifiers introduce bias in downstream parameter estimation even after calibration.
The Platonic Universe: Do Foundation Models See the Same Sky?: This paper validates the Platonic Representation Hypothesis (PRH) in an astronomical setting. Using JWST, HSC, Legacy Survey, and DESI spectroscopic data, it measures representation alignment across six foundation models (ViT/ConvNeXt/DINOv2/IJEPA/AstroPT/Specformer) and finds that both intra-modal and cross-modal MKNN scores consistently increase with model scale ($p = 3.31 \times 10^{-5}$), supporting the hypothesis that models of different architectures and modalities converge toward a shared representation.
TITAN: A Trajectory-Informed Technique for Adaptive Parameter Freezing in Large-Scale VQE: This paper proposes TITAN, a framework that employs deep learning models to predict "frozen parameters" in VQE—parameters that remain inactive throughout training—enabling 40–60% of parameters to be frozen at initialization, achieving up to 3× convergence speedup and 40–60% reduction in circuit evaluations, while matching or surpassing baseline accuracy on molecular systems of up to 30 qubits.
Toward Complete Merger Identification at Cosmic Noon with Deep Learning: A ResNet18 is trained on simulated HST CANDELS images generated from IllustrisTNG50, demonstrating for the first time that deep learning can successfully identify galaxy mergers at high redshift $1<z<1.5$, including minor mergers ($\mu \geq 1/10$) and low-mass galaxies ($M_\star > 10^8 M_\odot$), achieving an overall accuracy of ~73%. Model behavior is further analyzed through Grad-CAM and UMAP.
Transfer Learning Beyond the Standard Model: This work investigates whether neural networks pre-trained on the standard cosmological model (ΛCDM) can transfer to beyond-standard-model scenarios (massive neutrinos, modified gravity, primordial non-Gaussianity). The study finds that a dummy node architecture can reduce simulation requirements by an order of magnitude, but negative transfer emerges when parameters exhibit strong physical degeneracies (e.g., $\sigma_8$–$M_\nu$).
Unsupervised Discovery of High-Redshift Galaxy Populations with Variational Autoencoders: A variational autoencoder (VAE) is applied to unsupervised clustering of 2,743 JWST high-redshift ($z>4$) galaxy spectra, uncovering 12 distinct astrophysical categories and more than doubling the known sample sizes of rare populations including post-starburst galaxies, Lyman-α emitters, extreme emission line galaxies, and Little Red Dots.
Vision Transformers for Cosmological Fields: Application to Weak Lensing Mass Maps: This work presents the first systematic application of Vision Transformers (ViT and Swin Transformer) to constraining cosmological parameters ($\Omega_m$ and $S_8$) from weak lensing convergence maps, comparing attention-based architectures against CNNs within a simulation-based inference framework.

🧑 Human Understanding¶

BEDLAM2.0: Synthetic Humans and Cameras in Motion: BEDLAM2.0 is a comprehensive upgrade over BEDLAM, introducing diverse camera motions (synthetic translation/tracking/orbit + handheld/head-mounted capture), broader body shape coverage (BMI 18–41), strand-based hair, shoes, size-graded clothing, and more 3D environments. The resulting dataset comprises 27K+ sequences and 8M+ frames; models trained exclusively on this synthetic data surpass the state of the art in world-coordinate human motion estimation.
ConceptScope: Characterizing Dataset Bias via Disentangled Visual Concepts: This paper proposes ConceptScope, a framework that trains sparse autoencoders (SAE) on representations from visual foundation models to automatically discover and quantify visual concept biases in datasets, categorizing concepts into target / context / bias without any manual annotation.
CPEP: Contrastive Pose-EMG Pre-training Enhances Gesture Generalization on EMG Signals: This paper proposes the CPEP framework, which employs contrastive learning to align low-quality EMG signal representations with high-quality hand pose representations, endowing the EMG encoder with pose-awareness. CPEP is the first to achieve zero-shot recognition of unseen gestures from EMG signals, yielding a 21% improvement on in-distribution gesture classification and a 72% improvement on unseen gesture classification.
Cycle-Sync: Robust Global Camera Pose Estimation through Enhanced Cycle-Consistent Synchronization: Cycle-Sync is a global camera pose estimation framework that extends Message Passing Least Squares (MPLS) to camera position estimation, introduces a Welsch-type robust loss and cycle-consistency weighting, and surpasses all baselines—including complete SfM pipelines with bundle adjustment (BA)—without requiring BA.
DevFD: Developmental Face Forgery Detection by Learning Shared and Orthogonal LoRA Subspaces: This paper proposes DevFD—a developmental MoE architecture that models the common characteristics of real faces via a shared Real-LoRA, incrementally captures new forgery types via a sequence of orthogonal Fake-LoRAs, and mitigates catastrophic forgetting by integrating orthogonal gradient constraints into an orthogonal loss. DevFD achieves state-of-the-art accuracy and the lowest forgetting rate in continual learning for face forgery detection.
Foundation Cures Personalization: Improving Personalized Models' Prompt Consistency via Hidden Foundation Knowledge: FreeCure reveals that identity embeddings in face personalization models suppress but do not destroy the prompt control capability of the foundation model. Based on this insight, the paper proposes a training-free framework that injects attribute information from the foundation model into the personalized generation process via Foundation-Aware Self-Attention (FASA). The method substantially improves prompt consistency while preserving identity fidelity, and can be seamlessly integrated into mainstream architectures including SD, SDXL, and FLUX.
HOI-Dyn: Learning Interaction Dynamics for Human-Object Motion Diffusion: This paper models human-object interaction (HOI) generation as a Driver-Responder system, employing a lightweight Transformer-based interaction dynamics model to explicitly predict how objects respond to human actions. A residual dynamics loss is introduced during training to enforce causal consistency, while inference efficiency is preserved.
K-DeCore: Facilitating Knowledge Transfer in Continual Structured Knowledge Reasoning: This paper proposes K-DeCore, a framework that decouples structured knowledge reasoning into two stages — task-agnostic schema filtering and task-specific query construction — and combines dual-perspective memory construction with structure-guided pseudo-data synthesis to enable effective knowledge transfer across heterogeneous SKR tasks under a fixed parameter budget.
Mechanistic Interpretability of RNNs Emulating Hidden Markov Models: A vanilla RNN is trained to reproduce the emission statistics of a Hidden Markov Model (HMM), and its internal mechanisms are reverse-engineered to reveal that the network implements discrete stochastic state transitions via noise-sustained orbital dynamics, "kick neuron" circuits, and self-induced stochastic resonance.
MOSPA: Human Motion Generation Driven by Spatial Audio: This work introduces the novel task of spatial-audio-driven human motion generation, constructs the SAM dataset comprising 9+ hours, 27 scenes, and 12 subjects with paired binaural audio and motion data, and proposes the MOSPA diffusion model. By fusing audio features including MFCC, tempogram, and RMS with sound source position and motion style conditions, MOSPA achieves an FID of 7.98, substantially outperforming music/dance baselines such as EDGE (14.0) and POPDG (21.0).
OmniGaze: Reward-inspired Generalizable Gaze Estimation in the Wild: This paper proposes OmniGaze, a semi-supervised 3D gaze estimation framework that employs a reward model — fusing visual embeddings, MLLM-generated semantic gaze descriptions, and geometric direction vectors — to assess pseudo-label quality. Trained on 1.4 million unlabeled face images, OmniGaze achieves state-of-the-art performance under both within-domain and cross-domain settings across 5 datasets, and demonstrates zero-shot generalization on 4 unseen datasets.
PandaPose: 3D Human Pose Lifting from a Single Image via Propagating 2D Pose Prior to 3D Anchor Space: This paper proposes PandaPose, which propagates 2D pose priors into a 3D anchor space as a unified intermediate representation. By combining joint-wise adaptive 3D anchor setting with joint-wise depth distribution estimation, PandaPose achieves robust single-frame 3D human pose lifting against occlusion and 2D pose errors.
Part-Aware Bottom-Up Group Reasoning for Fine-Grained Social Interaction Detection: This paper proposes a part-aware bottom-up group reasoning framework that enhances individual embeddings with pose-guided body part features and infers social groups via similarity-based association, achieving new state-of-the-art results on the NVI and Café datasets.
RAPTR: Radar-Based 3D Pose Estimation Using Transformer: This paper proposes RAPTR, the first Transformer framework for radar-based 3D human pose estimation using weak supervision (3D bounding boxes + 2D keypoint labels). Through pseudo-3D deformable attention and structured loss functions, RAPTR substantially outperforms baselines on two indoor datasets.
Some Optimizers are More Equal: Understanding the Role of Optimizers in Group Fairness: This paper presents the first systematic study on how the choice of optimization algorithm affects group fairness in deep learning. Through stochastic differential equation (SDE) analysis and two novel theorems, it demonstrates that adaptive optimizers (RMSProp/Adam) are more likely to converge to fair minima than SGD, particularly under severe data imbalance.
Switchable Token-Specific Codebook Quantization for Face Image Compression: This paper proposes a Switchable Token-Specific Codebook Quantization (STSCQ) mechanism that employs a hierarchical dynamic structure combining image-level codebook routing and token-level codebook partitioning, achieving significant improvements in reconstruction quality and recognition accuracy for face image compression at ultra-low bitrates.
UnCLe: Towards Scalable Dynamic Causal Discovery in Non-Linear Temporal Systems: This paper proposes UnCLe, a scalable dynamic causal discovery method based on TCN autoencoder disentanglement and autoregressive dependency matrices. It infers time-varying causal relationships by measuring per-timestep prediction error increments following temporal perturbation, achieving state-of-the-art performance on both static and dynamic causal discovery benchmarks.
VASA-3D: Lifelike Audio-Driven Gaussian Head Avatars from a Single Image: This paper presents VASA-3D, which adapts VASA-1's 2D motion latent space to a 3D Gaussian splatting representation and leverages VASA-1-synthesized training data for single-image customization, enabling real-time generation (512×512, 75 fps) of lifelike audio-driven 3D head avatars from a single portrait image.
VimoRAG: Video-based Retrieval-augmented 3D Motion Generation for Motion Language Models: This paper proposes VimoRAG, a framework that leverages large-scale in-the-wild video databases as 2D motion priors to enhance 3D motion generation. Two core bottlenecks—human motion video retrieval and error propagation—are addressed via the Gemini-MVR retriever and the McDPO training strategy.

🎯 Object Detection¶

Ascent Fails to Forget: Starting from the statistical dependence between the forget set and the retain set, this paper theoretically and empirically demonstrates that the widely adopted gradient ascent / Descent-Ascent (DA) family of machine unlearning methods fails systematically in the presence of data correlations. In logistic regression, the DA solution is provably farther from the oracle than the original model, and in non-convex settings DA traps the model in inferior local minima.
Automated Detection of Visual Attribute Reliance with a Self-Reflective Agent: This paper proposes a self-reflective agent framework that automatically detects attribute reliance in visual models through an iterative hypothesis generation–testing–verification–reflection loop (e.g., CLIP recognizing "teacher" via classroom backgrounds, YOLOv8 detecting pedestrians via crosswalks). Evaluated on a benchmark of 130 models with injected known attribute dependencies, self-reflection is shown to significantly improve detection accuracy.
BurstDeflicker: A Benchmark Dataset for Flicker Removal in Dynamic Scenes: This paper introduces BurstDeflicker, the first large-scale benchmark dataset for multi-frame flicker removal (MFFR), comprising three complementary subsets — Retinex-based synthetic data, real-world static data, and green-screen dynamic data — systematically addressing the core bottleneck of obtaining aligned flickering–clean image pairs in dynamic scenes.
CQ-DINO: Mitigating Gradient Dilution via Category Queries for Vast Vocabulary Object Detection: To address positive gradient dilution and hard-negative gradient dilution in large-vocabulary (>10K category) object detection, this paper proposes CQ-DINO: replacing the classification head with learnable category queries and using image-guided Top-K category selection to reduce the negative space by 100×. CQ-DINO surpasses the previous SOTA by 2.1% AP on V3Det (13,204 categories) while remaining competitive on COCO.
BurstDeflicker: A Benchmark Dataset for Flicker Removal in Dynamic Scenes: This paper introduces BurstDeflicker, the first benchmark dataset for multi-frame flicker removal (MFFR) in dynamic scenes. It is constructed through three complementary strategies—Retinex-based synthesis, real-world static scene capture, and green-screen compositing—enabling large-scale training and evaluation that significantly improves the generalization of flicker removal models to real-world dynamic scenes.
DetectiumFire: A Comprehensive Multi-modal Dataset Bridging Vision and Language for Fire Understanding: DetectiumFire constructs the largest multi-modal fire understanding dataset — 14.5K real images + 2.5K videos + 8K synthetic images + 12K RLHF preference pairs — with a low duplication rate (0.03 PHash vs. D-Fire 0.15), a 4-level severity classification scheme, and detailed scene descriptions. Fine-tuning YOLOv11m achieves mAP 43.74, and fine-tuning LLaMA-3.2-11B yields 83.84% accuracy on fire severity classification.
DETree: DEtecting Human-AI Collaborative Texts via Tree-Structured Hierarchical Representation Learning: This paper proposes DETree, a framework that constructs a Hierarchical Affinity Tree (HAT) to model the hierarchical relationships among diverse human-AI collaborative text generation processes, and designs a Tree-Structured Contrastive Loss (TSCL) to align the representation space. DETree achieves significant advantages in mixed-text detection and OOD generalization scenarios.
DitHub: A Modular Framework for Incremental Open-Vocabulary Object Detection: DitHub reformulates the incremental adaptation problem in open-vocabulary object detection as a "version control" problem — training independent LoRA expert modules per category and managing an ever-growing module library via three primitives: branch, fetch, and merge. On ODinW-13 with full data, the method achieves 62.19 mAP, surpassing ZiRa by 4.21 points, while maintaining 47.01 zero-shot COCO performance.
FlexEvent: Towards Flexible Event-Frame Object Detection at Varying Operational Frequencies: This paper proposes FlexEvent, a framework that achieves flexible object detection with event cameras across varying operational frequencies through an adaptive event-frame fusion module (FlexFuse) and a frequency-adaptive fine-tuning mechanism (FlexTune). The framework maintains robust performance in the range of 20Hz to 180Hz, significantly outperforming existing methods.
Generalizable Insights for Graph Transformers in Theory and Practice: This paper proposes the Generalized-Distance Transformer (GDT), a graph Transformer architecture based on standard attention (requiring no modifications to the attention mechanism). It theoretically proves that GDT's expressiveness is equivalent to the GD-WL algorithm, and through large-scale experiments covering 8 million graphs and 270 million tokens, establishes for the first time a fine-grained empirical hierarchy of positional encoding (PE) expressiveness. Under a few-shot transfer setting, GDT surpasses state-of-the-art methods without any fine-tuning.
InstanceAssemble: Layout-Aware Image Generation via Instance Assembling Attention: This paper proposes InstanceAssemble, which injects an "instance assembling attention" mechanism into the Transformer blocks of DiT-based T2I models (SD3 and Flux). By performing independent cross-attention between image tokens within each bounding box region and their corresponding layout hidden states, the method achieves precise instance-level spatial control. A lightweight LoRA adaptation strategy maintains compatibility with existing style LoRAs. The paper also introduces the DenseLayout benchmark (5K images / 90K instances) and a multi-dimensional Layout Grounding Score (LGS) evaluation metric.
Delving into Cascaded Instability: A Lipschitz Continuity View on Image Restoration and Object Detection Synergy: This paper analyzes the root cause of instability in cascaded image restoration and object detection frameworks from a Lipschitz continuity perspective. It identifies an order-of-magnitude smoothness gap between the two networks and proposes LR-YOLO, which integrates the restoration task into the detection backbone's feature learning to regularize the detector's Lipschitz constant, consistently improving detection stability on dehazing and low-light enhancement benchmarks.
MSTAR: Box-Free Multi-Query Scene Text Retrieval with Attention Recycling: This paper presents MSTAR, the first multi-query scene text retrieval method that requires no bounding box annotations. Through Progressive Vision Embedding (PVE), MSTAR progressively shifts attention from salient to non-salient regions. Combined with style-aware instructions and a Multi-Instance Matching (MIM) module, it achieves unified retrieval across four query types—word, phrase, combined, and semantic—and introduces MQTR, the first multi-query text retrieval benchmark.
OverLayBench: A Benchmark for Layout-to-Image Generation with Dense Overlaps: OverLayBench introduces the first Layout-to-Image benchmark focused on dense overlap scenarios (4,052 samples + OverLayScore difficulty metric), revealing that SOTA methods suffer severe degradation in mIoU from 71% to 54% under complex overlaps, and proposes Amodal Mask supervision that achieves a 15.9% improvement in overlap IoU.
ReCon-GS: Continuum-Preserved Gaussian Streaming for Fast and Compact Reconstruction: This paper proposes ReCon-GS, which achieves incremental 3D reconstruction via continuum-preserved Gaussian streaming, substantially reducing storage requirements and training time while maintaining rendering quality, and supporting real-time reconstruction of large-scale scenes.
ReCon: Region-Controllable Data Augmentation with Rectification and Alignment for Object Detection: ReCon proposes a training-free, region-controllable data augmentation framework that enhances the detection data quality of existing structure-controllable generative models through Region-Guided Rectification (RGR) and Region-Aligned Cross-Attention (RACA), achieving 35.5 mAP on COCO—surpassing GeoDiffusion, which requires fine-tuning.
Test-Time Adaptive Object Detection with Foundation Model: This paper proposes TTAOD, a source-free open-vocabulary test-time adaptive object detection framework that combines multimodal Prompt Tuning, Mean-Teacher, an Instance Dynamic Memory (IDM) module, and memory augmentation/hallucination strategies. It achieves 56.2% AP50 on Pascal-C (+11.0 vs. SOTA) and demonstrates consistent gains across 13 cross-domain datasets.
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension: This paper proposes Video-RAG, a training-free, plug-and-play RAG pipeline that extracts visually-aligned auxiliary texts (OCR, ASR, object detection) from video, retrieves relevant content, and feeds it into LVLMs. With an overhead of only ~2K tokens, it improves average Video-MME performance by 2.8% across seven open-source LVLMs, and the 72B model surpasses GPT-4o.

👥 Social Computing¶

Active Slice Discovery in Large Language Models: This paper proposes the Active Slice Discovery problem framework, integrating active learning into LLM error slice discovery. By combining uncertainty sampling with LLM internal representations (raw embeddings or SAE features), the method achieves slice detection accuracy comparable to fully supervised settings using only 2–10% of labeled data.
Any Large Language Model Can Be a Reliable Judge: Debiasing with a Reasoning-based Bias Detector: This paper proposes the Reasoning-based Bias Detector (RBD), a plug-and-play debiasing module for LLM judges. By externally detecting four types of evaluation bias (verbosity, position, bandwagon, and sentiment), RBD generates structured feedback with reasoning chains to guide judges toward self-correction. RBD-8B achieves an average accuracy improvement of 18.5% and consistency improvement of 10.9% across 8 LLM judges.
Auto-Search and Refinement: An Automated Framework for Gender Bias Mitigation in LLMs: This paper proposes FaIRMaker, a framework that adopts an "auto-search + refinement" paradigm: it first employs gradient-based optimization to identify debiasing trigger tokens (Fairwords), then trains a seq2seq model to transform them into human-readable instructions, effectively mitigating gender bias on both open-source and closed-source LLMs while preserving or even improving task performance.
AVerImaTeC: A Dataset for Automatic Verification of Image-Text Claims with Evidence from the Web: AVerImaTeC introduces the first image-text fact-checking dataset with complete evidence annotation — 1,297 real-world image-text claims, a 5-stage annotation pipeline (extraction → QA reasoning → sufficiency check → iterative refinement → second check), and temporally constrained evidence (to prevent temporal leakage). The baseline system achieves 82% accuracy with ground-truth evidence, but drops to 15–25% under automatic evidence retrieval, revealing the substantial challenges of image-text verification.
Concept-Level Explainability for Auditing & Steering LLM Responses: This paper proposes ConceptX, an LLM explainability method based on concept-level (rather than token-level) Shapley attribution. It measures the influence of input concepts on outputs via semantic similarity rather than token overlap, and can be used to audit bias and steer LLM outputs through prompt editing — reducing attack success rate from 0.463 to 0.242 in jailbreak defense.
DATE-LM: Benchmarking Data Attribution Evaluation for Large Language Models: DATE-LM introduces the first unified benchmark for evaluating data attribution methods in LLMs. Through three application-driven tasks—training data selection, toxicity filtering, and factual attribution—it systematically compares multiple attribution approaches, finding that no single method dominates across all tasks and that simple baselines can match attribution methods in certain settings.
DeepTraverse: A Depth-First Search Inspired Network for Algorithmic Visual Understanding: Inspired by the depth-first search (DFS) algorithm, DeepTraverse is a visual backbone network that achieves highly competitive image classification performance with very few parameters, through a parameter-sharing recursive exploration module and an adaptive channel recalibration module.
Don't Let It Fade: Preserving Edits in Diffusion Language Models via Token Timestep Allocation: This paper proposes Token Timestep Allocation (TTA-Diffusion), which assigns independent denoising timesteps to each token to address the update-forgetting problem caused by classifier guidance in diffusion language models, achieving substantial improvements in both stability and efficiency for controllable text generation.
Evaluating Multiple Models Using Labeled and Unlabeled Data: This paper proposes SSME (Semi-Supervised Model Evaluation), which leverages a small amount of labeled data and a large amount of unlabeled data to estimate the joint distribution $P(y, \mathbf{s})$ of multiple classifiers via a semi-supervised mixture model, enabling accurate classifier performance evaluation with errors reduced to 1/5 of those incurred when using labeled data alone.
GraphKeeper: Graph Domain-Incremental Learning via Knowledge Disentanglement and Preservation: GraphKeeper is proposed to address catastrophic forgetting in Graph Domain-Incremental Learning (Graph Domain-IL) through three components: domain-specific LoRA parameter isolation, intra/inter-domain disentanglement, and ridge regression-based deviation-free knowledge preservation. It outperforms the second-best method by 6.5%–16.6% and can be seamlessly integrated with graph foundation models.
IF-GUIDE: Influence Function-Guided Detoxification of LLMs: This paper proposes IF-Guide, which leverages influence functions to identify toxic content in training data at the token granularity and actively suppresses the model from learning toxic behaviors during pre-training or fine-tuning via a penalty-based training objective, substantially outperforming passive alignment methods such as DPO and RAD.
Noise-Robustness Through Noise: A Framework Combining Asymmetric LoRA with Poisoning MoE: This paper proposes LoPE, which designates a dedicated "poisoning expert" within an asymmetric LoRA architecture to absorb injected noise during training; at inference time, this expert is masked so that only the clean experts contribute to the output — achieving noise robustness through noise itself, entirely without data cleaning.
OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents: This paper presents OS-Harm, the first safety benchmark targeting general-purpose computer use agents (beyond browser-only settings), covering 150 tasks across three risk categories — deliberate user misuse, prompt injection attacks, and model misbehavior. Evaluations reveal that frontier models (o4-mini, Claude 3.7 Sonnet, Gemini 2.5 Pro, etc.) broadly comply with harmful instructions (up to 70% unsafe rate) and exhibit a 20% compliance rate against basic prompt injection attacks.
Policy-as-Prompt: Turning AI Governance Rules into Guardrails for AI Agents: This paper proposes the Policy-as-Prompt framework, a two-stage end-to-end pipeline—POLICY-TREE-GEN and POLICY-AS-PROMPT-GEN—that automatically converts a team's existing unstructured design documents (PRD, TDD, code) into runtime-enforceable policy guardrails, using a lightweight LLM as a compliance "judge," achieving 70–73% input/output classification accuracy in HR and SOC applications.
Position Paper: If Innovation in AI Systematically Violates Fundamental Rights, Is It Innovation at All?: This paper challenges the prevailing belief that regulation and innovation are inherently at odds. Through historical analogies from pharmaceuticals, aviation, and welfare systems, combined with an analysis of the Collingridge dilemma, it argues that well-designed regulation serves as the foundation for sustainable innovation rather than an impediment to it. The regulatory sandbox, SME support mechanisms, and other provisions of the EU AI Act are presented as exemplars demonstrating how regulation can accelerate, rather than delay, responsible technological progress.
Precise Information Control in Long-Form Text Generation: This paper proposes the Precise Information Control (PIC) task, which requires LLMs to generate long-form text that strictly adheres to a given set of claims (neither omitting nor adding information). The authors construct PIC-Bench to evaluate 8 tasks, finding that over 70% of outputs from state-of-the-art models contain faithfulness hallucinations. Through weakly supervised preference data construction combined with DPO training, the proposed PIC-LM improves the F1 of an 8B model from 69.1% to 91.0%.
SLAyiNG: Towards Queer Language Processing: This work introduces SLAyiNG, the first explicitly annotated queer slang dataset, comprising 695 terms and nearly 200,000 usage instances. Inter-annotator agreement experiments (Krippendorff's $\alpha = 0.746$) demonstrate that reasoning models can serve as pre-screening tools but community-driven expert annotation remains indispensable.
VDRP: Visual Diversity and Region-aware Prompt Learning for Zero-shot HOI Detection: This paper proposes the VDRP framework, which addresses two core challenges in zero-shot HOI detection — intra-class visual diversity and inter-class visual entanglement — through visual diversity-aware prompt learning (via group-level variance injection and Gaussian perturbation) and region-aware prompt augmentation (via LLM-generated regional concept retrieval).

🌐 Multilingual & Translation¶

Adaptive Originality Filtering: Rejection-Based Prompting and RiddleScore for Culturally Grounded Multilingual Riddle Generation: This paper proposes Adaptive Originality Filtering (AOF)—a semantic rejection-sampling prompting strategy that filters repetitive or templated outputs via MiniLM embedding cosine similarity, compelling LLMs to generate more novel, diverse, and culturally grounded multilingual riddles. It also introduces the RiddleScore composite evaluation metric (Novelty + Diversity + Fluency + Alignment), achieving a human correlation of $\rho=0.83$.
DCAD-2000: A Multilingual Dataset across 2000+ Languages with Data Cleaning as Anomaly Detection: This work constructs DCAD-2000, a multilingual dataset covering 2,282 languages and 46.72 TB of text, and proposes a language-agnostic data cleaning framework that reformulates cleaning as anomaly detection. The framework extracts 8-dimensional statistical features per document and applies Isolation Forest for dynamic noise filtering. Effectiveness is validated on multiple multilingual benchmarks, with particularly notable gains on low-resource languages.
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection: This paper proposes a transparent, simple, and efficient model-based data selection framework for multilingual pretraining. It leverages FastText and Transformer (XLM-RoBERTa) embedding classifiers to identify structured and knowledge-rich samples. On the FineWeb-2 dataset, the framework matches baseline MMLU scores using only 15% of tokens, and is extended to 20 languages with publicly released curated pretraining datasets.
Exploring the Translation Mechanism of Large Language Models: This paper proposes a subspace-intervened path patching method for fine-grained causal analysis of the translation mechanism in LLMs. The study finds that translation is driven by a sparse set of attention heads comprising fewer than 5% of all heads, categorized into three functional roles: source heads, indicator heads, and positional heads. MLP layers integrate these features into an English-centric intermediate representation, and fine-tuning only 64 critical heads achieves performance comparable to full-parameter fine-tuning.
HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages: NVIDIA releases a 40K+ open-source human-annotated preference dataset covering general, STEM, code, and multilingual (13 languages) tasks. The reward model trained on this dataset achieves 82.4% (+10%) on RM-Bench, with a commercially friendly CC-BY-4.0 license.
How Data Mixing Shapes In-Context Learning: Asymptotic Equivalence for Transformers with MLPs: Under a high-dimensional asymptotic framework, this paper proves that Transformers with nonlinear MLP heads are asymptotically equivalent to structured polynomial predictors in terms of ICL error, revealing the gain mechanism of nonlinear MLPs on nonlinear tasks and establishing that low noise and structured covariance are key characteristics of high-quality data sources in multi-source data mixing.
MergeBench: A Benchmark for Merging Domain-Specialized LLMs: MergeBench is the first comprehensive benchmark suite for evaluating large-scale domain-specialized LLM merging, covering Llama and Gemma families up to 9B parameters, five task domains, and eight merging methods, providing systematic evaluation and practical guidelines across three dimensions: multi-task performance, forgetting, and runtime efficiency.
MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query: This paper introduces MERIT, the first multilingual interleaved multi-condition semantic retrieval dataset (320K queries, 135K products, 5 languages, 7 product categories), exposes the bottleneck of existing retrieval models that focus solely on global semantics while neglecting condition-level details, and proposes the Coral fine-tuning framework that combines embedding reconstruction with contrastive learning to achieve a 45.9% improvement in retrieval performance.
ParallelPrompt: Extracting Parallelism from Large Language Model Queries: This work presents ParallelPrompt, the first benchmark for intra-query parallelism, comprising structured decomposition annotations for 37,000+ real user prompts. It demonstrates that approximately 10% of user queries contain exploitable parallel structure, and that parallel execution can achieve up to 5.7× latency speedup with limited quality degradation.
Quantifying Climate Policy Action and Its Links to Development Outcomes: A Cross-National Data-Driven Analysis: This paper constructs an integrated NLP–econometrics framework that first uses a fine-tuned multilingual DistilBERT to automatically classify global climate policy documents by topic (Mitigation / Adaptation / Disaster Risk Management / Loss & Damage) with F1 = 0.90, then conducts fixed-effects panel regression against World Bank development indicators, finding that mitigation policies are significantly positively associated with higher GDP/GNI, while Loss & Damage policies remain substantially unimplemented worldwide.
Reflective Translation: Improving Low-Resource Machine Translation via Structured Self-Reflection: This paper proposes the Reflective Translation framework, which enables LLMs to perform structured self-critique of their initial translations at inference time—identifying mistranslations, omissions, and semantic distortions—and subsequently generate revised translations based on this critique. The approach requires no fine-tuning or additional annotated data, yet achieves statistically significant improvements in BLEU and COMET on low-resource African languages such as isiZulu and isiXhosa.
XIFBench: Evaluating Large Language Models on Multilingual Instruction Following: This paper proposes XIFBench — the first constraint-driven benchmark systematically evaluating LLMs' multilingual instruction-following capabilities. It comprises 558 instructions (0–5 constraints, 5 categories × 21 dimensions) across 6 languages (high/mid/low resource), and introduces an English-requirement anchoring evaluation protocol that achieves 94.7% cross-lingual evaluation consistency.
Zero-Shot Performance Prediction for Probabilistic Scaling Laws: This paper frames NLP learning curve prediction as a multi-task learning problem, employing a latent-variable multi-output Gaussian process (MaGP) to capture the bi-level hierarchical structure of datasets and inter-task correlations, enabling zero-shot prediction of learning curves and deriving probabilistic scaling laws via Monte Carlo simulation.

📡 Signal & Communications¶

Angular Steering: Behavior Control via Rotation in Activation Space: This paper proposes Angular Steering, which unifies LLM activation steering as rotation operations within a fixed 2D subspace — providing a continuous, fine-grained, norm-preserving behavior control knob spanning 0°–360° via rotation angle. The framework subsumes activation addition and directional ablation as special cases of rotation, and demonstrates robust behavior control on Llama 3 / Qwen 2.5 / Gemma 2 (3B–14B).
Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond): This work introduces the Infinity-Chat dataset (26K open-ended real-world user queries with 31,250 human annotations) to expose the "Artificial Hivemind" phenomenon in language models — severe intra-model repetition and inter-model homogeneity in open-ended generation — and demonstrates that Reward Models and LM Judges fail to calibrate on samples with high inter-annotator preference divergence.
Bispectral OT: Dataset Comparison using Symmetry-Aware Optimal Transport: This paper proposes Bispectral Optimal Transport (BOT), which replaces the cost matrix in discrete optimal transport from raw pixel distances to bispectrum (group Fourier invariant) distances. This enables the transport plan to precisely eliminate group-action-induced variation (e.g., rotation) while preserving signal structure, improving class-preservation accuracy from 33% to 84% on rotation-augmented datasets such as MNIST.
ConTextTab: A Semantics-Aware Tabular In-Context Learner: ConTextTab integrates semantic embeddings (text encodings of column names and categorical values) into a table-native ICL architecture, and pretrains on large-scale real-world tabular data (T4, ~2.18M tables). It achieves a new state of the art on the semantics-rich CARTE benchmark while remaining competitive with existing methods on non-semantic benchmarks.
Contrastive Consolidation of Top-Down Modulations Achieves Sparsely Supervised Continual Learning: This paper proposes Task-Modulated Contrastive Learning (TMCL), inspired by top-down modulations in the neocortex. TMCL integrates sparse label information (as few as 1% labels) via affine modulation during continual learning, then consolidates the modulation information into feedforward weights through contrastive learning, surpassing both unsupervised and supervised baselines on class-incremental and transfer learning benchmarks.
Estimation of Stochastic Optimal Transport Maps: This paper introduces a transport error metric $\mathcal{E}_p$ for stochastic OT maps, decomposed into an optimality gap and a feasibility gap. Under minimal assumptions that require neither the existence nor uniqueness of a Brenier map, a computationally efficient rounding estimator is constructed that achieves a near-optimal convergence rate of $\tilde{O}(n^{-1/(d+2p)})$. The framework is further extended to Hölder-continuous kernels and adversarially corrupted data, establishing the first general theory for OT map estimation.
Feature-aware Modulation for Learning from Temporal Tabular Data: This paper argues that the core challenge in temporal tabular learning is not simply "adding a time embedding," but rather that the semantics of many features drift over time. To address this, the paper proposes feature-aware modulation, which uses temporal context to dynamically generate per-feature shift, scale, and nonlinear shape parameters, re-aligning cross-temporal semantics. The approach enables deep models to consistently outperform GBDT on average rank for the first time on the TabReD benchmark.
Masked Symbol Modeling for Demodulation of Oversampled Baseband Communication Signals: This paper proposes Masked Symbol Modeling (MSM), transplanting BERT's masked prediction paradigm to the communication physical layer. It reframes inter-symbol contributions from pulse shaping as "contextual information," training a Transformer on clean oversampled baseband signals to learn waveform structure, and leveraging the learned context at inference time to recover symbols corrupted by impulsive noise.
Memory-Integrated Reconfigurable Adapters: A Unified Framework for Settings with Multiple Tasks: MIRA embeds Hopfield-style associative memory modules into each layer of a ViT, storing and retrieving LoRA adapter weights as key-value pairs. Through a two-stage training procedure (Adaptation + Consolidation), a single unified architecture simultaneously addresses Domain Generalization (DG), Class-Incremental Learning (CIL), and Domain-Incremental Learning (DIL), achieving substantial improvements over task-specific methods across multiple benchmarks.
Multi-Modal Masked Autoencoders for Learning Image-Spectrum Associations for Galaxy Evolution and Cosmology: A multi-modal image–spectrum–redshift dataset (GalaxiesML-Spectra) comprising 134,533 galaxies is constructed and adapted for a Multi-Modal Masked Autoencoder (MMAE) that performs joint reconstruction of images and spectra alongside redshift regression. Experiments demonstrate that, even when spectra are entirely absent at test time, using only 25% masked images achieves a redshift prediction scatter of $\sigma_{NMAD} = 0.016$, outperforming AstroCLIP.
Perturbation Bounds for Low-Rank Inverse Approximations under Noise: This work derives the first non-asymptotic spectral norm perturbation bounds for low-rank inverse approximations $\|(\tilde{A}^{-1})_p - A_p^{-1}\|$ under noise, via a novel contour bootstrapping technique that handles the non-entire function $f(z) = 1/z$. Under favorable conditions, the proposed bounds improve upon classical bounds by a factor of $\sqrt{n}$.
The Last Vote: A Multi-Stakeholder Framework for Language Model Governance: This paper proposes a comprehensive framework for language model governance comprising a seven-category democratic risk taxonomy, a stakeholder-adaptive Incident Severity Score (ISS), and a phased six-year implementation roadmap, with the goal of embedding democratic values into the institutional design of AI regulation.
The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning: This paper decomposes reinforcement learning with verifiable rewards (RLVR) into positive sample reinforcement (PSR, which increases the probability of correct responses) and negative sample reinforcement (NSR, which penalizes incorrect responses). It finds that NSR alone consistently improves reasoning performance across the full Pass@k spectrum and typically matches or surpasses PPO/GRPO. Based on this finding, the paper proposes Weighted-REINFORCE (reducing the PSR weight to 0.1), achieving state-of-the-art results across MATH, AIME 2025, and AMC23.

🛰️ Remote Sensing¶

C3PO: Cross-View Cross-Modality Correspondence by Pointmap Prediction: This paper introduces the C3 dataset comprising 90K ground photo–floor plan pairs (597 scenes, 153M pixel-level correspondences, and 85K camera poses), exposes the limitations of existing correspondence models under cross-view cross-modality settings (e.g., ground photos vs. floor plans), and demonstrates that training on this dataset reduces the RMSE of the best-performing baseline by 34%.
ChA-MAEViT: Unifying Channel-Aware Masked Autoencoders and Multi-Channel Vision Transformers for Improved Cross-Channel Learning: This paper proposes ChA-MAEViT, which enhances cross-channel feature learning for multi-channel images (MCI) through four key components: dynamic channel-patch joint masking, memory tokens, hybrid token fusion, and a channel-aware decoder. The method outperforms the state of the art by an average of 3.0–21.5% across three satellite and microscopy datasets.
Connecting the Dots: A Machine Learning Ready Dataset for Ionospheric Forecasting Models: As a product of the 2025 NASA Frontier Development Lab (FDL) Heliolab program, this paper presents the first comprehensive ML-ready dataset for ionospheric forecasting. It unifies seven categories of heterogeneous data sources — Solar Dynamics Observatory (SDO) extreme ultraviolet (EUV) irradiance embeddings, solar wind parameters, interplanetary magnetic field (IMF), geomagnetic activity indices, JPL dense TEC global ionospheric maps (GIMs), Madrigal sparse TEC, solar flux indices, and orbital mechanics parameters — into a consistent temporal-spatial structure. Building on this dataset, multiple spatiotemporal forecasting architectures are trained, including LSTM, Spherical Fourier Neural Operator (SFNO), and GraphCast, achieving autoregressive prediction of global vertical total electron content (vTEC) up to 12 hours ahead under both quiet and geomagnetically active conditions, surpassing the persistence baseline.
EcoCast: A Spatio-Temporal Model for Continual Biodiversity and Climate Risk Forecasting: This paper proposes EcoCast, a Transformer-based spatio-temporal sequence model that integrates satellite remote sensing (Sentinel-2), climate reanalysis (ERA5), and citizen science observations (GBIF). The model predicts next-month species occurrence probabilities from 12-month environmental feature sequences. On a five-species African bird distribution prediction task, the macro-average F1 score improves from 0.31 (Random Forest) to 0.65. An EWC-based continual learning framework is also designed to accommodate data updates.
GeoLink: Empowering Remote Sensing Foundation Model with OpenStreetMap Data: GeoLink directly integrates OpenStreetMap vector data into remote sensing foundation model pretraining by encoding OSM data with a heterogeneous GNN and designing multi-granularity cross-modal learning objectives (region–image-level contrastive + object–patch-level fusion). Pretrained efficiently on 1.27 million sample pairs, GeoLink surpasses existing RS FMs across 7 classification and 4 segmentation/change detection benchmarks.
GreenHyperSpectra: A Multi-Source Hyperspectral Dataset for Global Vegetation Trait Prediction: GreenHyperSpectra constructs a pretraining dataset of 140,000+ multi-source hyperspectral vegetation samples spanning proximal, airborne, and satellite platforms. Label-efficient regression models trained via semi-supervised and self-supervised methods (MAE, GAN, RTM-AE) comprehensively outperform fully supervised baselines on 7 plant trait prediction tasks, with particularly pronounced advantages under label-scarce and out-of-distribution scenarios.
Mass Conservation on Rails – Rethinking Physics-Informed Learning of Ice Flow Vector Fields: This paper proposes a divergence-free neural network (dfNN) that architecturally enforces exact mass conservation (divergence identically zero) via the symplectic gradient of a stream function, and combines it with a directional guidance learning strategy. The approach significantly outperforms soft-constraint PINNs and unconstrained NNs on ice flux interpolation over Antarctica's Byrd Glacier.
OrbitZoo: Real Orbital Systems Challenges for Reinforcement Learning: This paper presents OrbitZoo, a multi-agent RL environment built on the industrial-grade Orekit orbital mechanics library, supporting realistic orbital tasks such as collision avoidance, Hohmann transfers, and constellation coordination. It provides standardized MARL training through the PettingZoo interface, and achieves 24-meter RMSE (over a 16.6-hour propagation) for the low-error group in validation against real Starlink ephemeris data.
OrthoLoC: UAV 6-DoF Localization and Calibration Using Orthographic Geodata: OrthoLoC introduces the first large-scale UAV 6-DoF localization benchmark built upon orthographic geodata (DOP+DSM), comprising 16,425 real UAV images across 47 regions in Germany and the United States. It further proposes AdHoP (Adaptive Homography Preprocessing), a plug-and-play matching enhancement that improves matching performance by 95% and reduces translation error by 63% without modifying the underlying feature matcher.
RSCC: A Large-Scale Remote Sensing Change Caption Dataset for Disaster Events: This work introduces RSCC — the first large-scale disaster-aware remote sensing change captioning dataset comprising 62,351 pre/post-disaster image pairs with detailed change descriptions, covering 31 global disaster events including earthquakes, floods, and wildfires. High-quality annotations are generated using the QvQ-Max visual reasoning model, and a comprehensive benchmark evaluation framework is established.
Scaling Image Geo-Localization to Continent Level: A hybrid approach combining classification-learned prototypes with aerial image embeddings achieves 68%+ recall@1 within 200 m and 59.2% within 100 m across 433,000 km² of Western Europe — the first system to attain such precision at continental scale.

🔎 AIGC Detection¶

ASCIIBench: Evaluating Language-Model-Based Understanding of Visually-Oriented Text: This paper introduces ASCIIBench, the first publicly available benchmark for ASCII art understanding and generation (5,315 images, 752 categories). Systematic evaluation reveals that the visual modality substantially outperforms the text modality, multimodal fusion yields no benefit, and CLIP exhibits a fundamental bottleneck in representing ASCII structure—only categories with high intra-class consistency can be effectively distinguished.
Can LLMs Write Faithfully? An Agent-Based Evaluation of LLM-generated Islamic Content: This paper proposes a dual-agent (quantitative + qualitative) evaluation framework that systematically assesses the faithfulness of GPT-4o, Ansari AI, and Fanar on Islamic content generation tasks across three dimensions—theological accuracy, citation integrity, and stylistic appropriateness—finding that even the best-performing model exhibits significant deficiencies in citation reliability.
Classical Planning with LLM-Generated Heuristics: Challenging the State of the Art with Python Code: This paper proposes having LLMs generate Python code for domain-dependent heuristic functions (rather than directly generating plans). Candidate heuristics are obtained via $n$ samples and the best is selected on a training set, then injected into the Python planner Pyperplan for use with GBFS. The approach surpasses all C++ Fast Downward traditional heuristics on 8 IPC 2023 benchmark domains using pure Python, matches the SOTA learned planner $h^{\mathrm{WLF}}_{\mathrm{GPR}}$, and guarantees 100% correctness for all plans found.
CLAWS: Creativity Detection for LLM-Generated Solutions Using Attention Window of Sections: This paper proposes CLAWS, a method that analyzes the attention weight distribution of LLMs across different prompt sections during mathematical solution generation to classify outputs as "creative," "typical," or "hallucinated," without requiring human evaluation.
DuoLens: A Framework for Robust Detection of Machine-Generated Multilingual Text and Code: DuoLens is proposed — an AI-generated content detection framework based on dual-encoder fusion of CodeBERT and CodeBERTa — achieving AUROC of 0.97–0.99 on multilingual text (8 languages) and source code (7 programming languages) at significantly reduced computational cost (8–12× lower latency, 3–5× lower VRAM), substantially outperforming large models such as GPT-4o.
"Jutters": Through the metaphor of the Dutch tradition of jutters (beachcombers), this work constructs an immersive installation art piece that integrates real beach debris with AI-generated images and videos, guiding visitors to adopt a beachcomber's mindset in reflecting on how to engage with AI-generated content.
Reasoning Compiler: LLM-Guided Optimizations for Efficient Model Serving: This paper proposes Reasoning Compiler, which models compiler optimization as a sequential decision-making process, employing an LLM as a context-aware proposal engine combined with MCTS to balance exploration and exploitation. The approach achieves an average 5.0× speedup across 5 representative benchmarks and 5 hardware platforms, with 10.8× better sampling efficiency than TVM's evolutionary search.
Synthesizing Performance Constraints for Evaluating and Improving Code Efficiency: This paper proposes Wedge, a framework that uses LLMs to synthesize performance-characterizing constraints to guide constraint-aware fuzzing, generating stress-test inputs that expose code performance bottlenecks. It further constructs the PerfForge benchmark, enabling LLM-based code optimizers (e.g., Effi-Learner) to achieve up to 24% additional reduction in CPU instructions.

✏️ Knowledge Editing¶

Edit Less, Achieve More: Dynamic Sparse Neuron Masking for Lifelong Knowledge Editing in LLMs: This paper proposes NMKE, a framework that identifies two categories of knowledge neurons—knowledge-general and knowledge-specific—via neuron-level attribution, and applies entropy-guided dynamic sparse masking to achieve precise neuron-level knowledge editing. NMKE maintains high edit success rates and general model capabilities after 5,000 consecutive edits.
KScope: A Framework for Characterizing the Knowledge Status of Language Models: This paper proposes a five-category taxonomy of LLM knowledge status (Consistent Correct / Conflicting Correct / Missing / Conflicting Incorrect / Consistent Incorrect) and the KScope hierarchical statistical testing framework. By combining repeated sampling with multi-step hypothesis testing, KScope precisely characterizes the modal structure of an LLM's knowledge for a given question, and systematically investigates how context updates each knowledge state. The study finds that constrained context summarization combined with credibility augmentation improves knowledge update success rates by an average of 4.3%.
MemEIC: A Step Toward Continual and Compositional Knowledge Editing: This paper proposes MemEIC, a three-tier framework for continual and compositional knowledge editing in large vision-language models (LVLMs), combining an external dual-modal retrieval memory (Mem-E), an internal modality-decoupled LoRA adapter (Mem-I), and a brain-inspired Knowledge Connector. MemEIC substantially outperforms existing methods on the newly introduced CCKEB benchmark.
MEMOIR: Lifelong Model Editing with Minimal Overwrite and Informed Retention for LLMs: MEMOIR introduces a framework that incorporates zero-initialized residual memory matrices into FFN layers, employs TopHash-based sparse masks to confine each edit to a distinct subset of memory parameters, and at inference time conditionally activates stored knowledge by measuring mask overlap. The approach achieves an optimal balance among reliability, generalization, and locality across 15,000 sequential edits.
Rethinking Residual Distribution in Locate-then-Edit Model Editing: This paper reveals that the residual distribution mechanism in locate-then-edit model editing introduces weight deviation errors that grow with distribution distance, batch size, and sequential edit length. It proposes BLUE (Boundary Layer UpdatE), a strategy that updates only the first and last critical layers, achieving an average improvement of 35.59%.
UniEdit: A Unified Knowledge Editing Benchmark for Large Language Models: This paper presents UniEdit — the first unified LLM knowledge editing benchmark built upon an open-domain knowledge graph (Wikidata), covering 311K samples across 25 domains in 5 major categories. By introducing the Neighborhood Multi-hop Chain Sampling (NMCS) algorithm, UniEdit integrates diverse generalization and locality evaluation criteria into a single framework, systematically revealing the shortcomings of existing editing methods under complex ripple effect evaluations.

🗣️ Dialogue Systems¶

AC-LoRA: (Almost) Training-Free Access Control-Aware Multi-Modal LLMs: AC-LoRA is an end-to-end system that trains independent LoRA adapters for datasets with different permission levels. At inference time, it dynamically retrieves and training-freely merges multiple LoRA outputs based on cosine similarity and user permissions, achieving strong information isolation while matching or surpassing SOTA LoRA mixture methods in response quality.
Bridging Human and LLM Judgments: Understanding and Narrowing the Gap: This paper proposes Bridge, a statistical framework that models the latent relationship between human and LLM judgments via ordinal logistic regression. With a small number of human labels, Bridge improves the calibration and alignment of LLM judgments while supporting formal statistical hypothesis testing for systematic biases.
HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location: This paper proposes HyGen, an interference-aware LLM inference system that achieves elastic co-location of online and offline workloads through an accurate batch latency predictor, an SLO-aware performance profiler, and a prefix-sharing-maximization scheduling strategy, delivering 3.87–5.84× throughput gains while strictly guaranteeing SLO compliance.
MetaMind: Modeling Human Social Thoughts with Metacognitive Multi-Agent Systems: This paper proposes MetaMind — a multi-agent framework inspired by psychological metacognition theory — that significantly enhances the social reasoning capabilities of LLMs through three-stage collaboration: a ToM Agent (mental state hypothesis generation), a Moral Agent (social norm-constrained refinement), and a Response Agent (response generation with self-verification). MetaMind achieves state-of-the-art performance on multiple social intelligence benchmarks, approaching human-level performance for the first time.
SciArena: An Open Evaluation Platform for Non-Verifiable Scientific Literature-Grounded Tasks: SciArena is a community-driven open evaluation platform for scientific literature tasks. It adopts a Chatbot Arena-style human preference voting paradigm to rank 47 foundation models, collecting over 20,000 votes, and releases SciArena-Eval as a meta-benchmark for assessing the ability of automated evaluation systems to judge answer quality on literature-grounded tasks.

🌍 Earth Science¶

A Probabilistic U-Net Approach to Downscaling Climate Simulations: This work presents the first application of a probabilistic U-Net to statistical climate downscaling (16× super-resolution). By sampling from a variational latent space, the model generates ensemble forecasts for uncertainty quantification. The paper systematically compares four training objectives — WMSE, MS-SSIM, WMSE-MS-SSIM, and afCRPS — revealing complementary trade-offs between extreme event capture and fine-scale spatial variability preservation.
Adaptive Online Emulation for Accelerating Complex Physical Simulations: This paper proposes Adaptive Online Emulation (AOE), a framework that dynamically trains an ELM-based neural network surrogate model during physical simulation execution to replace expensive computational components, requiring no offline pretraining. On an exoplanetary atmospheric simulation, AOE achieves an 11.1× speedup (91% time savings) with only ~0.01% accuracy loss.
ControlFusion: A Controllable Image Fusion Framework with Language-Vision Degradation Prompts: This paper proposes ControlFusion, a controllable infrared-visible image fusion framework based on language-vision degradation prompts. It employs a physics-driven degradation imaging model to simulate compound degradations, and uses a prompt-modulated network to perform dynamic restoration and fusion, achieving comprehensive state-of-the-art performance under both real-world and compound degradation scenarios.
Predicting Public Health Impacts of Electricity Usage: This paper proposes HealthPredictor, an AI pipeline that maps electricity consumption end-to-end to public health damages (measured in $/MWh), comprising three modules: fuel mix prediction, air quality conversion, and health impact assessment. Health-driven optimization significantly reduces health impact prediction error compared to fuel-mix-driven baselines, and achieves a 24–42% reduction in health damages in an EV charging scheduling case study.
Reasoning With a Star: A Heliophysics Dataset and Benchmark for Agentic Scientific Reasoning: This paper introduces Reasoning With a Star (RWS), a 158-question scientific reasoning benchmark derived from NASA heliophysics summer school problem sets, spanning three answer types (numeric/symbolic/textual). Paired with a unit-aware grader, it evaluates four multi-agent coordination paradigms (HMAW/PACE/PHASE/SCHEMA) and finds that no single paradigm dominates across all tasks — the systems-engineering-inspired SCHEMA achieves the strongest performance on tasks requiring rigorous constraint validation.

🧠 Mixture of Experts¶

MoRE-Brain: Routed Mixture of Experts for Interpretable and Generalizable Cross-Subject fMRI Visual Decoding: This paper proposes MoRE-Brain, a neuroscience-inspired fMRI visual decoding framework that employs a hierarchical Mixture-of-Experts (MoE) architecture to simulate the specialized processing of the brain's visual pathway. Combined with a dynamic temporal-spatial dual-routing mechanism that guides image generation via a diffusion model, MoRE-Brain achieves high-fidelity reconstruction while enabling efficient cross-subject generalization and unprecedented mechanistic interpretability.

📂 Others¶

3DID: Direct 3D Inverse Design for Aerodynamics with Physics-Aware Optimization: This paper proposes the 3DID framework, which learns a unified physics-geometry triplane latent representation, performs objective-gradient-guided diffusion sampling, and applies a two-stage topology-preserving refinement strategy to conduct inverse design directly in the full 3D space starting from random noise. On vehicle aerodynamic shape optimization, 3DID reduces simulated drag (Sim-Drag) by 13.6% compared to the best baseline.
4DGT: Learning a 4D Gaussian Transformer Using Real-World Monocular Videos: This paper proposes 4DGT — a 4D Gaussian-based Transformer model trained entirely on real-world monocular posed videos that reconstructs dynamic scenes in seconds via feed-forward inference, significantly outperforming comparable feed-forward networks while achieving accuracy on par with optimization-based methods.
A Differentiable Model of Supply-Chain Shocks: A JAX-based differentiable Agent-Based Model (ABM) of supply chains (~1,000 firms) that combines GPU parallelization and automatic differentiation to achieve Bayesian parameter calibration three orders of magnitude faster than conventional ABC, paving the way for shock-propagation modeling in global supply-chain networks.
A Generalized Label Shift Perspective for Cross-Domain Gaze Estimation: This paper formulates cross-domain gaze estimation (CDGE) as a generalized label shift (GLS) problem, demonstrating that existing domain-invariant representation learning methods are theoretically insufficient under label shift. It proposes continuous importance reweighting based on truncated Gaussian distributions and a Probability-aware Conditional Operator Discrepancy (PCOD) to jointly correct label shift and conditional shift, achieving an average error reduction of 12%–27% across multiple backbones.
A Sustainable AI Economy Needs Data Deals That Work for Generators: This paper introduces the concept of the "Economic Data Processing Inequality" — in the ML value chain, data progresses from raw form to model weights to synthetic outputs, with each step refining technical signals while systematically stripping economic rights from data generators. The authors empirically validate this phenomenon through analysis of 73 publicly available data transactions, diagnose three structural deficiencies (missing provenance, asymmetric bargaining power, non-dynamic pricing), and propose the EDVEX framework as a solution blueprint.
A Theoretical Framework for Grokking: Interpolation followed by Riemannian Norm Minimisation: This paper rigorously proves the mechanism behind grokking from a purely optimization-theoretic perspective. Gradient flow with small weight decay exhibits two-phase dynamics in the $\lambda\to 0$ limit: rapid convergence to the critical manifold $\mathcal{M}$ of the training loss, followed by a Riemannian gradient flow along the manifold minimizing the $\ell_2$ norm at timescale $t\approx 1/\lambda$, thereby inducing delayed generalization.
A Unified Framework for Variable Selection in Model-Based Clustering with Missing Not at Random: Within a Gaussian mixture model clustering framework, this paper jointly addresses variable selection (distinguishing signal, redundant, and noise variables) and MNAR missing data modeling. A two-stage strategy—LASSO-penalized ranking followed by BIC-based role assignment—combined with spectral-distance adaptive penalty weights enables efficient inference in high-dimensional settings. Identifiability and asymptotic selection consistency are established theoretically.
Active Measurement: Efficient Estimation at Scale: This paper proposes the Active Measurement framework, which uses AI model predictions as an importance sampling proposal distribution and achieves unbiased estimation of scientific aggregate quantities through iterative human annotation and model updates, complemented by a novel combination weighting scheme and a conditional variance estimator for constructing reliable confidence intervals.
AcuRank: Uncertainty-Aware Adaptive Computation for Listwise Reranking: AcuRank maintains a probability distribution over document relevance via a Bayesian TrueSkill model, and at each iteration selectively reranks only documents whose positions remain uncertain. This yields a reranking framework that adaptively allocates computation according to query difficulty, surpassing fixed-computation baselines on multiple benchmarks with fewer LLM calls.
Adaptive Data Analysis for Growing Data: This paper establishes the first generalization bounds for adaptive analysis over dynamically growing data, permitting analysts to schedule queries adaptively based on current dataset size, and achieving increasingly tight guarantees as data accumulates via time-varying empirical accuracy bounds and differential privacy mechanisms.
Addressing Mark Imbalance in Integration-free Neural Marked Temporal Point Processes: This paper is the first to systematically reveal the severe impact of mark distribution imbalance on prediction performance in marked temporal point processes (MTPP). It proposes a mark-first-then-time prediction strategy, designs a thresholding method to calibrate the predicted probabilities of rare marks, and develops the integration-free IFNMTPP model to efficiently support mark probability estimation and time sampling.
Adjoint Schrödinger Bridge Sampler: This paper proposes the Adjoint Schrödinger Bridge Sampler (ASBS), which reinterprets the Schrödinger Bridge problem as a stochastic optimal control (SOC) problem. This eliminates the memoryless condition required by prior diffusion samplers, supports arbitrary source distributions (e.g., Gaussian, harmonic priors), and employs a scalable matching objective without importance weight estimation. ASBS consistently outperforms prior methods on multi-particle energy functions and molecular conformation generation.
Adjusted Count Quantification Learning on Graphs: This paper extends the classical Adjusted Classify & Count (ACC) quantification method to graph-structured data, proposing two techniques — Structural Importance Sampling (SIS) and Neighborhood-aware ACC (N-ACC) — to address structural covariate shift and non-homophilous edges in graph quantification, respectively.
ADPretrain: Advancing Industrial Anomaly Detection via Anomaly Representation Pretraining: This work presents ADPretrain, the first dedicated representation pretraining framework for industrial anomaly detection. By learning residual feature representations via angle-oriented and norm-oriented contrastive losses on the large-scale RealIAD dataset, the pretrained features consistently improve five mainstream embedding-based AD methods across five datasets and five backbone networks when substituted for the original features.
Alias-Free ViT: Fractional Shift Invariance via Linear Attention: This paper proposes the Alias-Free Vision Transformer (AFT), which combines anti-aliasing signal processing techniques with shift-equivariant linear cross-covariance attention, achieving near-perfect consistency (~99%) under fractional (sub-pixel) shifts for the first time, with negligible degradation in ImageNet classification accuracy.
An Empirical Investigation of Neural ODEs and Symbolic Regression for Dynamical Systems: This paper systematically investigates the extrapolation capability of Neural ODEs (NODEs) on noisy synthetic data, and explores a pipeline that employs NODEs as a data augmentation tool combined with symbolic regression (SR) to recover governing equations from limited data. Results demonstrate that this combined approach can recover two of three governing equations—and a strong approximation of the third—using only 10% of the simulation data.
EPHAD: An Evidence-Based Post-Hoc Adjustment Framework for Anomaly Detection Under Data Contamination: EPHAD proposes a test-time post-processing framework that corrects the output of anomaly detection models trained on contaminated data via Bayesian-style fusion with external evidence (e.g., CLIP, LOF) through exponential tilting. The framework requires no access to the training pipeline and consistently improves detection performance of contaminated models across 8 visual and 26 tabular AD datasets.
Are Pixel-Wise Metrics Reliable for Sparse-View Computed Tomography Reconstruction?: This paper reveals that pixel-level metrics such as PSNR and SSIM fail to capture anatomical structural completeness in sparse-view CT reconstruction (correlation only 0.16–0.30), and proposes anatomy-aware metrics (NSD/clDice) based on automated segmentation alongside the CARE framework—which incorporates segmentation-guided loss into diffusion model training—achieving 32% improvement in structural completeness for large organs and 36% for vessels.
AutoSciDACT: Automated Scientific Discovery through Contrastive Embedding and Hypothesis Testing: This work proposes the AutoSciDACT pipeline, which first employs supervised contrastive learning to compress high-dimensional scientific data into a 4-dimensional embedding space, then applies NPLM (New Physics Learning Machine) likelihood-ratio testing to statistically quantify distributional deviations in the embedding space. The pipeline achieves $\geq 3\sigma$ discovery at signal injection ratios of $\leq 1\%$ across astronomical, particle physics, pathology, image, and synthetic datasets.
Brain-Like Processing Pathways Form in Models With Heterogeneous Experts: Heterogeneous experts in Mixture-of-Experts models do not spontaneously form processing pathways. This paper proposes three brain-inspired inductive biases — routing cost, task-performance scaling, and expert dropout — that enable the model to develop a Mixture-of-Pathways architecture analogous to the brain's dynamic cortical–subcortical pathways.
Computable Universal Online Learning: This paper introduces computability constraints into the universal online learning framework, proving that "mathematically learnable" does not imply "learnable by a computer program," and provides precise characterizations of computable learning under both agnostic and proper variants.
ConTextTab: A Semantics-Aware Tabular In-Context Learner: ConTextTab integrates semantic embeddings (text encodings of column names and categorical values) into a table-native ICL architecture, and pretrains on large-scale real-world tabular data (T4, ~2.18M tables). It achieves a new SOTA on the semantics-rich CARTE benchmark while remaining competitive with existing methods on non-semantic benchmarks.
Contextual Dynamic Pricing with Heterogeneous Buyers: This paper presents the first systematic study of contextual dynamic pricing with heterogeneous buyers of $K_\star$ unknown types. It proposes an Optimistic Posterior Sampling (OPS)-based algorithm achieving an $\tilde{O}(K_\star\sqrt{dT})$ regret bound (optimal in $d$ and $T$), and further introduces ZoomV—a variance-aware adaptive discretization algorithm—achieving the optimal $\tilde{O}(\sqrt{K_\star T})$ regret in the non-contextual setting.
Continuous Thought Machines: This paper proposes the Continuous Thought Machine (CTM), which generates neuron-level temporal dynamics via privately parameterized Neuron-Level Models (NLMs) and employs a neural synchrony matrix as the core latent representation. The model demonstrates complex reasoning, adaptive computation, and interpretable attention behavior on tasks including maze solving, ImageNet classification, and parity checking.
Coreset for Robust Geometric Median: Eliminating Size Dependency on Outliers: This paper is the first to eliminate the dependency of the robust geometric median coreset size on the number of outliers $m$: under the condition $n \geq 4m$, it achieves an optimal coreset size of $\tilde{\Theta}(\varepsilon^{-1/2} + \frac{m}{n}\varepsilon^{-1})$ for $d=1$, and $\tilde{O}(\varepsilon^{-2}\min\{\varepsilon^{-2}, d\})$ in high dimensions. The core technical contribution is a novel non-componentwise error analysis.
Coresets for Clustering Under Stochastic Noise: This paper presents the first systematic study of $(k,z)$-clustering coreset construction under noisy data. It proposes a novel surrogate error metric $\mathsf{Err}_\alpha$ to replace the traditional $\mathsf{Err}$, achieving a $\text{poly}(k)$-fold reduction in coreset size and a $\text{poly}(k)$-fold tightening of quality guarantees under mild data assumptions, along with a noise-aware cluster-wise sampling algorithm.
Deep Continuous-Time State-Space Models for Marked Event Sequences: S2P2 unifies linear Hawkes processes with deep state space models by stacking multiple implicit Linear Hawkes (LLH) layers with nonlinear activations, yielding a highly expressive continuous-time MTPP model. It leverages parallel scanning to achieve linear complexity and sub-linear runtime, improving predictive likelihood by an average of 33% across 8 real-world datasets.
Deep Learning for Continuous-Time Stochastic Control with Jumps: Two model-based deep learning algorithms (GPI-PINN and GPI-CBU) are proposed to solve finite-horizon continuous-time stochastic control problems with jumps. By iteratively training a policy network and a value network, the approach avoids discretization and simulation of state dynamics, and demonstrates strong performance in high-dimensional settings.
Deep Legendre Transform: DLT exploits the implicit Fenchel representation of convex conjugates, $f^*(\nabla f(x)) = \langle x, \nabla f(x) \rangle - f(x)$, to reformulate conjugate computation as a standard regression problem, thereby avoiding max/min-max optimization. The method also admits a posteriori error estimation, and when combined with KAN, yields exact closed-form solutions.
Dense Associative Memory with Epanechnikov Energy: This paper proposes a log-sum-ReLU (LSR) energy function based on the Epanechnikov kernel as a replacement for the conventional log-sum-exp (LSE) energy in Dense Associative Memory. For the first time, it achieves the coexistence of exact retrieval of all stored patterns and the emergence of novel creative local minima, while preserving exponential memory capacity.
Depth-Bounds for Neural Networks via the Braid Arrangement: This paper proves that, under $\mathcal{B}_d^0$-conforming constraints, exactly representing $\max\{0, x_1, \ldots, x_d\}$ with a ReLU network requires $\Omega(\log \log d)$ layers—the first non-constant depth lower bound without weight restrictions. It also shows that rank-(3,2) maxout networks can compute the maximum of 7 values, demonstrating that the standard upper bound is not tight.
Depth-Supervised Fusion Network for Seamless-Free Image Stitching: DSFN proposes a seamless image stitching method with depth consistency constraints: a depth-aware two-stage transformation estimation addresses large-parallax alignment, soft-seam region diffusion enables natural blending, and a re-parameterization strategy improves efficiency. The method comprehensively surpasses the state of the art on the UDIS-D and IVSD datasets.
Directional Non-Commutative Monoidal Structures for Compositional Embeddings in Machine Learning: This paper proposes an algebraic framework based on directional non-commutative monoid operators, providing a unified mathematical foundation for multi-dimensional compositional embeddings and unifying SSM recurrence, Transformer self-attention, and RoPE positional encoding as special cases.
Distributionally Robust Feature Selection: This paper proposes a model-agnostic distributionally robust feature selection method that achieves a continuous relaxation of discrete selection by injecting controlled Gaussian noise into covariates, and optimizes the conditional variance of the Bayes-optimal predictor, so that the selected feature subset enables high-quality downstream models to be trained simultaneously across multiple subpopulations.
Double Descent Meets Out-of-Distribution Detection: Theoretical Insights and Empirical Analysis: This paper is the first to reveal a double descent phenomenon in post-hoc OOD detection—OOD detection performance exhibits a valley near the interpolation threshold as model width increases, then recovers—provides a theoretical explanation via random matrix theory, and proposes an NC1 criterion based on Neural Collapse to identify the optimal model complexity regime.
DPA: A One-Stop Metric to Measure Bias Amplification in Classification Datasets: This paper proposes Directional Predictability Amplification (DPA), a predictability-based metric for measuring bias amplification. It is the only one-stop metric that simultaneously satisfies directionality, applicability to both balanced and imbalanced datasets, and correct identification of positive and negative bias amplification, by measuring the relative change between model bias and dataset bias.
Efficient Kernelized Learning in Polyhedral Games Beyond Full-Information: From Colonel Blotto to Congestion Games: This paper proposes a kernelization-based framework for designing computationally efficient no-regret learning algorithms for polyhedral games (Colonel Blotto, graphic matroid congestion games, and network congestion games) under partial-information feedback, significantly improving the runtime complexity for learning coarse correlated equilibria (CCE).
Efficient Parametric SVD of Koopman Operator for Stochastic Dynamical Systems: This paper proposes a low-rank approximation (LoRA)-based objective to learn the top-k singular functions of the Koopman operator for stochastic dynamical systems, entirely avoiding the numerically unstable matrix decomposition operations present in VAMPnet/DPNet, with naturally unbiased gradients.
Emergency Response Measures for Catastrophic AI Risk: This paper systematically analyzes how Frontier Safety Policies (FSPs) can be integrated into the first two stages of China's four-phase emergency response framework (prevention–early warning–response–recovery), employing dangerous capability evaluations, tiered thresholds, and pre-established safety measures to address catastrophic AI risks. The analysis is further contextualized through comparisons with international practices such as the EU AI Act and California SB 53.
Evolutionary Learning in Spatial Agent-Based Models for Physical Climate Risk Assessment: This paper proposes an Agent-Based Model (ABM) that integrates geospatial climate hazard data with evolutionary learning mechanisms. Using a simplified economic network comprising a three-tier commodity–manufacturing–retail supply chain, the model simulates economic responses from 2025 to 2100 under RCP8.5 flood projections. Results demonstrate that evolutionary adaptation enables firms to maintain significantly higher levels of production, capital, liquidity, and employment under climate stress, while revealing supply chain systemic risks that traditional asset-level assessments fail to capture.
Evolutionary Prediction Games: This paper proposes the "Evolutionary Prediction Games" framework, applying evolutionary game theory to analyze feedback loops between prediction algorithms and user populations. It shows that ideal learners lead to competitive exclusion (survival of the fittest), whereas practical learners (with finite data, surrogate losses, or overparameterization) can instead foster stable coexistence and mutualism among groups.
Exact Learning of Arithmetic with Differentiable Agents: This paper proposes the Differentiable Finite-State Transducer (DFST), a Turing-complete and end-to-end differentiable model family that operates on a 2D symbol grid. Trained via Policy-Trajectory Observations (PTOs) derived from expert arithmetic computations, DFST achieves perfect generalization to 3,850-digit binary addition and 2,450-digit decimal addition using only 20 training samples of up to 3-digit addition, with zero observed errors.
FACE: Faithful Automatic Concept Extraction: This paper proposes FACE, a framework that incorporates a KL divergence regularization term into non-negative matrix factorization (NMF) to constrain reconstructed activations to remain consistent with the original model's predictions, thereby extracting concept explanations that are truly faithful to the model's decision process. FACE comprehensively outperforms CRAFT and ICE on ImageNet, COCO, and CelebA.
Faithful Group Shapley Value: This paper proposes the Faithful Group Shapley Value (FGSV), the unique group-level data valuation method satisfying five axioms including "faithfulness," which effectively defends against the "shell company attack" (artificially inflating valuation by splitting subgroups), and introduces an efficient approximation algorithm with $O(n \cdot \text{Poly}(\log n))$ complexity.
Finite-Time Analysis of Stochastic Nonconvex Nonsmooth Optimization on the Riemannian Manifolds: This paper proposes the Riemannian Online to NonConvex (RO2NC) algorithm and its zeroth-order variant ZO-RO2NC, establishing for the first time a finite-time sample complexity guarantee of $O(\delta^{-1}\epsilon^{-3})$ for fully nonsmooth nonconvex stochastic optimization on Riemannian manifolds, matching the optimal result in Euclidean space.
FlashMD: Long-Stride, Universal Prediction of Molecular Dynamics: FlashMD is proposed as a GNN-based framework that directly predicts the positional and momentum evolution of molecular dynamics trajectories with long strides, achieving time steps 1–2 orders of magnitude larger than those of conventional MD integrators. The architecture incorporates Hamiltonian dynamics constraints and generalizes to arbitrary thermodynamic ensembles and universal chemical systems.
FlowMoE: A Scalable Pipeline Scheduling Framework for Distributed MoE Training: FlowMoE proposes a unified pipeline scheduling framework that integrates MHA computation, gating, expert computation, and A2A communication into a single pipeline. A priority-driven all-reduce tensor chunking mechanism maximizes communication–computation overlap, achieving 1.13×–1.82× speedup, 10–39% energy reduction, and 7–32% memory savings across multiple real-world MoE models.
Fostering the Ecosystem of AI for Social Impact Requires Expanding and Strengthening Evaluation Standards: This paper argues that the academic ecosystem of AI for Social Impact (AISI) requires a dual-track reform: broadening the definition of "impact" to recognize contributions beyond deployment or methodological novelty, while simultaneously demanding causal-inference-level rigor in evaluating deployed systems.
Frequency-Aware Token Reduction for Efficient Vision Transformer: This paper proposes frequency-aware token reduction from a frequency-domain perspective, partitioning tokens into high-frequency (HF) and low-frequency (LF) groups. HF tokens are selectively retained while LF tokens are aggregated into DC tokens, simultaneously alleviating rank collapse and reducing computational cost in ViTs. The method outperforms existing SOTA across multiple models at a 30% token reduction ratio.
FSNet: Feasibility-Seeking Neural Network for Constrained Optimization with Guarantees: This paper proposes FSNet, a framework that integrates differentiable feasibility-seeking steps into neural networks. By minimizing constraint violations via unconstrained optimization, FSNet guarantees constraint satisfaction while supporting end-to-end training. It significantly outperforms traditional solvers in speed across convex/non-convex and smooth/non-smooth problems while maintaining feasibility.
Gaussian Process Upper Confidence Bound Achieves Nearly-Optimal Regret in Noise-Free Gaussian Process Bandits: This paper proves that the GP-UCB algorithm achieves nearly-optimal regret bounds in noise-free GP bandit problems, including the first $O(1)$ constant cumulative regret under the SE kernel and the Matérn kernel (with $d > \nu$), thereby closing the long-standing gap between the theory of GP-UCB and its empirical performance.
Generalized Linear Mode Connectivity for Transformers: This paper proposes a unified symmetry framework (a four-level hierarchy of permutation, semi-permutation, orthogonal, and invertible transformations) to achieve zero or near-zero barrier linear mode connectivity (LMC) on Vision Transformers and GPT-2 for the first time, and further extends the framework to multi-model merging and heterogeneous-width alignment.
Graph Alignment via Birkhoff Relaxation: This paper provides the first theoretical guarantees for the Birkhoff relaxation of the graph alignment problem (relaxing the permutation matrix constraint to doubly stochastic matrices), proving a phase transition in the Gaussian Wigner model: when $\sigma = o(n^{-1})$, the relaxed solution approximates the true permutation; when $\sigma = \Omega(n^{-0.5})$, the relaxed solution is far from the true permutation.
Harnessing Feature Resonance under Arbitrary Target Alignment for Out-of-Distribution Node Detection: This paper discovers the Feature Resonance phenomenon—when optimizing the representations of known in-distribution (ID) nodes, unknown ID nodes undergo significantly larger representational changes than OOD nodes, and this phenomenon is label-agnostic. Based on this observation, the authors propose RSL, a graph OOD node detection framework that requires no multi-class labels, achieving state-of-the-art performance across 13 datasets.
Hessian-guided Perturbed Wasserstein Gradient Flows for Escaping Saddle Points: This paper proposes the Perturbed Wasserstein Gradient Flow (PWGF) algorithm, which injects noise perturbations via Hessian-guided Gaussian processes to enable efficient saddle point escape and second-order optimality in probability measure optimization.
How Many Domains Suffice for Domain Generalization? A Tight Characterization via the Domain Shattering Dimension: This paper introduces the Domain Shattering Dimension (Gdim), a novel combinatorial measure that tightly characterizes the number of domains required for domain generalization (i.e., the domain sample complexity), and establishes its relationship to the classical VC dimension as $\Theta(d \log(1/\alpha))$.
Hybrid-Balance GFlowNet for Solving Vehicle Routing Problems: This paper proposes the Hybrid-Balance GFlowNet (HBG) framework, which for the first time introduces Detailed Balance (DB) in the VRP setting and unifies it with Trajectory Balance (TB), along with a depot-guided inference strategy. HBG consistently and significantly improves two existing GFlowNet-based solvers (AGFN and GFACS) on CVRP and TSP benchmarks.
Impact of Layer Norm on Memorization and Generalization in Transformers: This work systematically reveals the fundamentally distinct roles of LayerNorm in Pre-LN and Post-LN Transformers: in Pre-LN, LN is essential for learning and its removal disrupts generalization; in Post-LN, LN drives memorization and its removal suppresses memorization while recovering true labels.
Improved Approximation Algorithms for Chromatic and Pseudometric-Weighted Correlation Clustering: For two important generalizations of Correlation Clustering—Chromatic CC and pseudometric-weighted CC—this paper achieves a 2.15-approximation and a tight 10/3-approximation, respectively, via LP relaxation and carefully designed rounding functions, significantly improving upon the previous best results of 2.5 and 6.
Improving Decision Trees through the Lens of Parameterized Local Search: This paper analyzes local search operations for decision tree optimization through the lens of parameterized complexity, identifies the sources of computational hardness, and proves that the combination of the number of features and domain size yields fixed-parameter tractability (FPT), accompanied by a proof-of-concept implementation.
Improving Forecasts of Suicide Attempts for Patients with Little Data: This paper proposes the Latent Similarity Gaussian Process (LSGP), which embeds patients into a continuous latent space to capture heterogeneity, enabling data-scarce patients to "borrow" predictive trends from similar patients, thereby improving suicide attempt prediction based on EMA data.
Inferring Stochastic Dynamics with Growth from Cross-Sectional Data: This paper proposes Unbalanced Probabilistic Flow Inference (UPFI), which jointly infers the drift, diffusion, and growth rate of stochastic dynamical systems from cross-sectional data via a Lagrangian formulation of the Fokker-Planck equation, constituting the first method to accurately handle scenarios involving cell proliferation and death.
Information-Computation Tradeoffs for Noiseless Linear Regression with Oblivious Contamination: For noiseless linear regression under the oblivious contamination model, this paper formally proves that any efficient Statistical Query algorithm requires VSTAT complexity at least $\tilde{\Omega}(d^{1/2}/\alpha^2)$, providing evidence that the quadratic dependence on $1/\alpha$ constitutes an essential computational lower bound for efficient algorithms.
Infrequent Exploration in Linear Bandits: This paper proposes the INFEX framework, which executes a baseline algorithm (e.g., LinUCB/LinTS) at designated exploration steps according to a given schedule and selects arms greedily at all other time steps. It is proven that as long as the number of exploration steps exceeds $\omega(\log T)$, INFEX achieves the same poly-logarithmic regret as full-time exploration while substantially reducing computational overhead (80%–99% of time steps are greedy).
Johnson-Lindenstrauss Lemma Beyond Euclidean Geometry: This paper extends the Johnson-Lindenstrauss (JL) lemma from Euclidean space to general symmetric hollow dissimilarity matrices, proposing two complementary approaches — pseudo-Euclidean JL and generalized power distance JL — where the approximation error scales proportionally with the degree of deviation from Euclidean geometry.
Kernel Conditional Tests from Learning-Theoretic Bounds: A unified framework is proposed for converting confidence bounds of learning algorithms into conditional hypothesis tests. Built upon kernel ridge regression, the framework yields conditional two-sample tests with finite-sample guarantees and, for the first time, supports non-i.i.d. data and online sampling scenarios.
Lagrangian neural ODEs: Measuring the existence of a Lagrangian with Helmholtz metrics: This paper proposes Helmholtz metrics — differentiable metrics derived from the Helmholtz conditions — to quantify how closely a given ODE approximates the Euler-Lagrange equations. These metrics are incorporated as regularization terms into second-order Neural ODE training, forming Lagrangian Neural ODEs that guide the model toward true physical laws with zero additional inference overhead.
Learning-Augmented Online Bipartite Fractional Matching: This paper proposes two learning-augmented algorithms (LAB and PAW) for online bipartite fractional matching. Given a potentially inaccurate advice matching, both algorithms Pareto-dominate the naïve CoinFlip strategy across the entire robustness spectrum for the first time.
Learning-Augmented Streaming Algorithms for Correlation Clustering: This paper proposes the first learning-augmented streaming algorithms for Correlation Clustering. By leveraging pairwise distance predictions, the proposed methods achieve a better-than-3 approximation ratio on complete graphs ($\tilde{O}(n)$ space) and an $O(\log|E^-|)$ approximation ratio on general graphs ($\tilde{O}(n)$ space), significantly improving the space–approximation tradeoff over existing prediction-free algorithms.
Learning (Approximately) Equivariant Networks via Constrained Optimization: This paper proposes ACE (Adaptive Constrained Equivariance), a framework that formulates equivariant neural network training as a constrained optimization problem. Via dual methods, ACE automatically and progressively transitions from a flexible non-equivariant model to an equivariant one, adapting to both fully and partially symmetric data without manual hyperparameter tuning.
Learning Dense Hand Contact Estimation from Imbalanced Data: This paper proposes the HACO framework, which addresses class imbalance via Balanced Contact Sampling (BCS) and spatial imbalance via a Vertex-level Class-Balanced Loss (VCB Loss). HACO is the first dense hand contact estimation model trained across 14 datasets (655K images) and achieves state-of-the-art performance across diverse interaction scenarios.
Learning Dynamics of RNNs in Closed-Loop Environments: This paper establishes a mathematical theory revealing that RNNs exhibit fundamentally different learning dynamics under closed-loop (agent–environment interaction) versus open-loop (supervised learning) training. Closed-loop learning follows a three-phase process driven by the competition between short-term policy improvement and long-term stability.
Learning non-equilibrium diffusions with Schrödinger bridges: from exactly solvable to simulation-free: This paper generalizes the Schrödinger bridge problem (SBP) from Brownian motion reference processes to multivariate Ornstein-Uhlenbeck (mvOU) reference processes, derives exact solutions for the Gaussian case, and proposes the simulation-free mvOU-OTFM algorithm for general distributions.
Learning to Condition: A Neural Heuristic for Scalable MPE Inference: This paper proposes Learning to Condition (L2C), which trains an attention network to learn dual scores — optimality and simplification — for variable-value pairs from solver search trajectories, guiding conditioning decisions in MPE inference over probabilistic graphical models (PGMs). L2C substantially reduces the search space on high-treewidth models while maintaining or improving solution quality.
Look-Ahead Reasoning on Learning Platforms: This paper formalizes level-$k$ look-ahead reasoning in user–algorithm interactions on learning platforms. It proves that individually selfish higher-order reasoning only accelerates convergence without altering the equilibrium (i.e., no long-term gain), while the benefit of collective coordination is determined by the alignment between the learner's and users' utility functions. A theoretical framework is provided to characterize upper bounds on coordination gains.
MAS-ZERO: Designing Multi-Agent Systems with Zero Supervision: MAS-ZERO is the first inference-time automatic MAS design framework. Through a meta-agent that iteratively designs, critiques, and refines MAS configurations (including task decomposition and sub-MAS assignment), it requires no validation set or training, and outperforms both manual and automatic MAS baselines on reasoning (+16.69%), programming (+16.66%), and search agent (+5.45%) tasks while maintaining a Pareto-optimal accuracy–cost trade-off.
MaxSup: Overcoming Representation Collapse in Label Smoothing: By decomposing the loss function of Label Smoothing (LS), this paper identifies an "error amplification term" that exacerbates misclassification, leading to intra-class feature collapse. The proposed Max Suppression (MaxSup) method redirects the penalty target from the ground-truth logit to the top-1 logit, eliminating the error amplification effect while preserving beneficial regularization.
MEGState: Phoneme Decoding from Magnetoencephalography Signals: This paper proposes MEGState, an architecture combining multi-resolution convolution and sensor-wise state space models (SSMs) for decoding phonemes from magnetoencephalography (MEG) signals, achieving substantial improvements over baseline methods on the LibriBrain dataset.
Meta-learning three-factor plasticity rules for structured credit assignment with sparse feedback: This paper proposes a meta-learning framework that automatically discovers local neo-Hebbian synaptic plasticity rules via outer-loop gradient optimization, enabling recurrent neural networks to perform structured credit assignment using only sparse, delayed reward signals, thereby providing new insights into the learning mechanisms of biological neural networks.
MetaFind: Scene-Aware 3D Asset Retrieval for Coherent Metaverse Scene Generation: MetaFind is a scene-aware tri-modal (text + image + point cloud) 3D asset retrieval framework that encodes scene layout information via an SE(3)-equivariant spatial-semantic graph neural network (ESSGNN), enabling iterative asset retrieval with style consistency and spatial coherence for metaverse scene generation.
MiCADangelo: Fine-Grained Reconstruction of Constrained CAD Models from 3D Scans: MiCADangelo emulates the reverse engineering workflow of human CAD designers: it extracts 2D patterns via multi-plane cross-section analysis, predicts constrained parametric sketches, and optimizes extrusion parameters, achieving for the first time complete parametric model reconstruction with sketch constraints in 3D CAD reverse engineering.
Military AI Needs Technically-Informed Regulation to Safeguard AI Research and its Applications: This paper proposes a behavior-oriented definition and regulatory framework for AI-powered Lethal Autonomous Weapons Systems (AI-LAWS). It identifies systems requiring enhanced regulation through two technical criteria, puts forward five concrete policy recommendations, and calls on AI researchers to participate actively throughout the full lifecycle of military AI governance.
Modeling Cell Dynamics and Interactions with Unbalanced Mean Field Schrödinger Bridge: This paper proposes the Unbalanced Mean Field Schrödinger Bridge (UMFSB) framework and the CytoBridge deep learning algorithm, which simultaneously model unbalanced stochastic cell dynamics and cell-cell interactions from sparse temporal snapshot data.
Modeling Neural Activity with Conditionally Linear Dynamical Systems: This paper proposes Conditionally Linear Dynamical Systems (CLDS), where Gaussian process priors allow the parameters of a linear dynamical system to vary nonlinearly as a function of observed experimental covariates, preserving the interpretability and efficient inference of linear models while capturing the nonlinear dynamics prevalent in neural circuits.
MutualVPR: A Mutual Learning Framework for Resolving Supervision Inconsistencies via Adaptive Clustering: This paper proposes MutualVPR, a mutual learning framework that dynamically assigns scene category labels through feature-driven adaptive K-means clustering, addressing the supervision inconsistency problem in classification-based VPR methods caused by viewpoint variation and occlusion.
Neural Collapse in Cumulative Link Models for Ordinal Regression: An Analysis with Unconstrained Feature Model: This paper extends Neural Collapse (NC) theory to ordinal regression (OR) tasks based on cumulative link models (CLM). Under the unconstrained feature model (UFM) framework, three hallmark properties of Ordinal Neural Collapse (ONC) are formally proven: within-class mean collapse (ONC1), feature collapse onto a one-dimensional subspace (ONC2), and ordered arrangement of latent variables by class (ONC3). In the zero-regularization limit, a concise geometric relationship between latent variables and thresholds is revealed.
Neural Network for Simulating Radio Emission from Extensive Air Showers: A simple fully connected neural network is employed to replace computationally expensive CoREAS Monte Carlo simulations, enabling fast prediction of radio pulses from extensive air showers (EAS) while achieving $X_{\text{max}}$ reconstruction resolution comparable to conventional simulations.
Non-Clairvoyant Scheduling with Progress Bars: This paper introduces a "progress bar" information model as an interpolation framework between clairvoyant and non-clairvoyant scheduling. It designs scheduling algorithms with optimal consistency–robustness tradeoffs for both adversarial and stochastic progress bars, while advancing the theoretical frontier of learning-augmented scheduling.
Nonlinearly Preconditioned Gradient Methods: Momentum and Stochastic Analysis: Under the anisotropic descent inequality framework, this paper introduces heavy ball momentum into nonlinearly preconditioned gradient methods and analyzes the convergence properties of their stochastic variants under multiple noise assumptions, thereby unifying the theoretical analysis of gradient clipping and normalized gradient methods.
Normalization in Attention Dynamics: This paper unifies various normalization schemes (Post-LN, Pre-LN, Mix-LN, Peri-LN, nGPT, sqrt-scaling) under a single framework of velocity modulation in an interacting particle system on the sphere. It theoretically characterizes how each scheme affects token clustering dynamics and representation collapse, identifying Peri-LN as the theoretically optimal choice.
Obliviator Reveals the Cost of Nonlinear Guardedness in Concept Erasure: This paper proposes Obliviator — a post-processing concept erasure method based on HSIC minimization in RKHS — that iteratively deforms the feature space through a two-step optimization procedure. It is the first method to achieve complete guardedness against nonlinear adversaries, while quantifying the utility-erasure trade-off of nonlinear guardedness. Obliviator substantially outperforms existing methods across multiple PLMs and datasets.
On a Geometry of Interbrain Networks: This opinion piece proposes introducing discrete graph curvature (Forman-Ricci and Ollivier-Ricci curvature) into interbrain network analysis within hyperscanning research. It leverages the entropy of curvature distributions to detect network phase transitions and uses curvature values to infer interbrain information routing strategies, moving beyond the descriptive limitations of conventional correlation-based metrics.
On Agnostic PAC Learning in the Small Error Regime: In the small error regime of agnostic PAC learning ($\tau \approx d/m$), this paper constructs a computationally efficient learner based on ERM aggregation that achieves an error upper bound of $c \cdot \tau + O(\sqrt{\tau d/m} + d/m)$ with $c \leq 2.1$, matching known lower bounds and advancing the precise complexity characterization of agnostic learning.
On the Surprising Effectiveness of Large Learning Rates under Standard Width Scaling: This paper reveals that under standard parameterization (SP), the cross-entropy loss causes the previously monolithic "unstable" regime to split into two distinct sub-regimes: catastrophic instability and controlled divergence. In the controlled divergence regime ($\eta_n = \Theta(n^{-1/2})$), logits diverge while gradients and activations remain stable, thereby establishing the first practically useful infinite-width limit for SP that admits feature learning.
On Topological Descriptors for Graph Products: This paper systematically investigates the expressive power of topological descriptors — Euler Characteristic (EC) and Persistent Homology (PH) — computed on (box) products of graphs under various filtrations. It proves that PH descriptors on graph products are strictly more expressive than those computed on individual graphs, whereas EC does not enjoy this property, and proposes an efficient algorithm for computing PH on product graphs.
On Universality Classes of Equivariant Networks: This paper proves that the separation power of equivariant neural networks (i.e., their ability to distinguish symmetry-inequivalent inputs) is insufficient to fully characterize their approximation capacity—models with identical separation power may possess strictly different approximation abilities. The paper provides a complete characterization of universality classes for shallow invariant networks and establishes sufficient conditions for universality failure.
One Sample is Enough to Make Conformal Prediction Robust: This paper proposes RCP1 (Robust Conformal Prediction with One sample), which certifies the conformal procedure itself rather than individual conformity scores. Requiring only a single randomly perturbed forward pass at inference, RCP1 yields smaller robust prediction sets than state-of-the-art methods that require 100 forward passes.
Optimism Without Regularization: Constant Regret in Zero-Sum Games: This paper provides the first proof that Optimistic Fictitious Play without regularization achieves $O(1)$ constant regret in $2\times2$ zero-sum games, matching the optimal rate of regularized Optimistic FTRL. It further establishes an $\Omega(\sqrt{T})$ regret lower bound for Alternating Fictitious Play, separating the capabilities of optimism and alternation in the unregularized setting.
Optimized Learned Count-Min Sketch: This paper proposes OptLCMS, which partitions the score space and analytically solves CMS parameters via KKT conditions while optimizing thresholds through dynamic programming, substantially accelerating the construction process and providing theoretical guarantees on the probability of intolerable error.
OrbitZoo: Real Orbital Systems Challenges for Reinforcement Learning: This paper presents OrbitZoo, a multi-agent RL environment built on the industrial-grade astrodynamics library Orekit. It integrates high-fidelity orbital dynamics (including atmospheric drag, solar radiation pressure, and third-body effects), a PettingZoo multi-agent interface, and real-time 3D visualization. Validation against real Starlink ephemerides yields a mean MAPE of only 0.16%.
OrthoLoC: UAV 6-DoF Localization and Calibration Using Orthographic Geodata: OrthoLoC establishes the first large-scale UAV 6-DoF localization benchmark dataset based on orthographic geodata (DOP+DSM), comprising 16,425 real UAV images across 47 regions in Germany and the United States. It further introduces AdHoP (Adaptive Homography Preprocessing), a matching enhancement technique that improves matching performance by 95% and reduces translation error by 63% without modifying the underlying feature matcher.
Out-of-distribution Generalisation is Hard: Evidence from ARC-like Tasks: By constructing ARC-like tasks with well-defined OOD metrics, this paper demonstrates that standard neural networks (MLP/CNN/Transformer) fail to achieve compositional OOD generalisation. Moreover, even architectures designed with correct inductive biases that attain near-perfect OOD performance may still learn incorrect compositional features.
Overfitting in Adaptive Robust Optimization: This paper establishes an analogy between policy fragility in Adaptive Robust Optimization (ARO) and overfitting in machine learning: adaptive policies perform well within the uncertainty set but may fail outside it. The paper proposes constraint-specific uncertainty set sizing as a "regularization" mechanism to balance robustness and adaptability.
Plasticity as the Mirror of Empowerment: This paper proposes Generalized Directed Information (GDI) as an information-theoretic tool for measuring agent plasticity, revealing that plasticity is the "mirror" of empowerment — both use the same measure but in opposite directions — and proves a strict tension bound between the two.
Note 7: Value-Guided Search - Efficient Chain-of-Thought Reasoning: This paper proposes Value-Guided Search (VGS), which employs a token-level value model to guide block-level beam search without requiring predefined "steps." VGS achieves a +14.5% relative accuracy improvement over majority voting on competition mathematics while reducing inference computation by 30%, outperforming existing PRM-based approaches.
Position: There Is No Free Bayesian Uncertainty Quantification: This paper challenges the validity of Bayesian uncertainty quantification (UQ) from a frequentist perspective, reinterprets Bayesian updating as an optimization problem over model ensembles, and proposes a PAC-framework-based calibration algorithm for constructing prediction intervals with frequentist guarantees.
Prediction-Powered Semi-Supervised Learning with Online Power Tuning: This paper extends the Prediction-Powered Inference (PPI) framework to the training phase of semi-supervised learning. It proposes an unbiased gradient estimator and designs an online AdaGrad algorithm to dynamically tune the interpolation parameter $\lambda$ between pseudo-labels and true labels, achieving convergence rates matching the optimal fixed $\lambda$ while maintaining unbiasedness.
Private Evolution Converges: This paper provides the first convergence guarantee for the Private Evolution (PE) synthetic data generation algorithm that does not rely on unrealistic assumptions, proving that under appropriate hyperparameter settings, the $(ε,δ)$-DP synthetic dataset output by PE achieves a 1-Wasserstein distance of $\tilde{O}(d(nε)^{-1/d})$.
Product Distribution Learning with Imperfect Advice: This paper studies the problem of learning product distributions over the Boolean hypercube given an imperfect advice distribution, and proposes an efficient algorithm that achieves sub-linear dependence on dimension $d$ in sample complexity when the advice is of sufficient quality.
Radar: Benchmarking Language Models on Imperfect Tabular Data: This paper introduces the Radar benchmark, which systematically evaluates language models' data-aware reasoning on imperfect tabular data by injecting five categories of data artifacts (missing values, bad values, outliers, formatting inconsistencies, and logical inconsistencies) into real-world tables. The benchmark reveals that even frontier models suffer substantial performance degradation upon the introduction of data artifacts.
Recurrent Self-Attention Dynamics: An Energy-Agnostic Perspective from Jacobians: This paper adopts a dynamical systems perspective grounded in Jacobian analysis to move beyond the symmetry constraints imposed by traditional energy-function frameworks. It reveals the critical role of normalization layers in suppressing the spectral norm and oscillatory components of self-attention, identifies that high-performing recurrent self-attention models exhibit Lyapunov exponents approaching zero (a criticality regime), and proposes a spectral regularization method that substantially improves inference performance.
Redundancy-Aware Test-Time Graph Out-of-Distribution Detection: This paper proposes RedOUT, a framework that constructs coding trees via structural entropy minimization to eliminate redundant information in graph structures. Combined with the Redundancy-aware Graph Information Bottleneck (ReGIB) principle, RedOUT effectively distinguishes in-distribution (ID) from out-of-distribution (OOD) graph samples at test time without modifying pretrained model parameters, achieving an average AUC of 87.46% across 10 dataset pairs.
Regression Trees Know Calculus: This paper reveals gradient information latent in piecewise-constant regression trees — by treating the difference in child-node means as a finite-difference analogue, it efficiently extracts gradient estimates, thereby importing differential tools such as Active Subspaces (AS) and Integrated Gradients (IG) into tree models, broadening both their interpretability and predictive improvement capabilities.
Reliable Active Learning from Unreliable Labels via Neural Collapse Geometry: This paper proposes NCAL-R, which leverages the Neural Collapse (NC) geometry emerging in the terminal training phase of deep networks. Two scoring metrics—Class Mean Alignment Perturbation (CMAP) and Feature Fluctuation (FF)—are designed for sample selection, making active learning more reliable under label noise and distribution shift. The method consistently outperforms conventional AL baselines on ImageNet-100 and CIFAR-100.
Note 5: ReSearch — Learning to Reason with Search: ReSearch embeds search operations as first-class primitives within reasoning chains and leverages GRPO reinforcement learning to automatically learn when and how to search—without any supervision on intermediate reasoning steps—achieving an average relative improvement of 15.81% over baselines on multi-hop QA benchmarks.
ResNets Are Deeper Than You Think: This paper proves that residual networks and feedforward networks occupy distinct function spaces (i.e., ResNets are not a simple reparameterization of feedforward networks), and demonstrates through post-training partial linearization experiments that variable-depth architectures (ResNet-like) consistently outperform fixed-depth architectures even after controlling for trainability differences, suggesting that residual connections provide inductive biases beyond optimization.
Rethinking PCA Through Duality: This paper revisits PCA through the Difference-of-Convex (DC) framework, establishing kernelization and out-of-sample extension capabilities, revealing that simultaneous iteration is a special case of DCA, and proposing a kernelizable dual formulation for robust $\ell_1$-PCA.
Revisiting Agnostic Boosting: This paper proposes a new agnostic boosting algorithm that substantially improves the sample complexity of prior work under very general assumptions, and establishes nearly matching lower bounds, thereby resolving the sample complexity of agnostic boosting up to logarithmic factors.
RNNs Perform Task Computations by Dynamically Warping Neural Representations: This paper proposes a Riemannian geometric framework that pulls back the metric from the RNN state space onto the input manifold, demonstrating that RNNs perform computation by dynamically warping their representations of task variables—compressing task-irrelevant inputs and stretching space near decision boundaries. Crucially, this warping is not a byproduct of computation but constitutes computation itself.
Robust Sampling for Active Statistical Inference: This paper proposes a robust sampling strategy based on budget-preserving paths that optimally interpolates between uniform sampling and active sampling, ensuring the resulting estimator's variance is never worse than either baseline. This addresses the performance degradation caused by inaccurate uncertainty estimation in active statistical inference.
SAD Neural Networks: Divergent Gradient Flows and Asymptotic Optimality via o-minimal Structures: Using mathematical tools from o-minimal structures, this paper establishes a dichotomy for gradient flows in fully connected networks with common smooth activation functions (sigmoid, tanh, softplus, GELU, etc.): the flow either converges to a critical point or diverges to infinity with the loss converging to an asymptotic critical value. In particular, for polynomial target functions, the paper proves that the loss cannot be exactly zero but can be made arbitrarily close to zero, which necessarily causes parameter divergence.
Sample-Adaptivity Tradeoff in On-Demand Sampling: This paper systematically studies the tradeoff between sample complexity and adaptive rounds in on-demand sampling. In the realizable setting, it proves that the optimal sample complexity of $r$-round algorithms is $dk^{\Theta(1/r)}/\varepsilon$. In the agnostic setting, it proposes the LazyHedge algorithm that achieves near-optimal sample complexity in only $\widetilde{O}(\sqrt{k})$ rounds, and introduces the OODS abstract framework to establish nearly tight round complexity lower bounds.
Scalable GPU-Accelerated Euler Characteristic Curves: Optimization and Differentiable Learning for PyTorch: This paper proposes an ECC CUDA kernel optimized for modern Ampere GPUs, achieving 16–2000× speedup over prior GPU implementations, and introduces a differentiable PyTorch layer supporting end-to-end topological feature learning on dense grid images via DECT-style sigmoid relaxation.
Scalable Inference of Functional Neural Connectivity at Submillisecond Timescales: This paper generalizes the conventional discrete-time Poisson GLM to a continuous-time Poisson point process framework. Two approaches—Monte Carlo sampling and second-order polynomial approximation—are proposed to bypass the intractable integral in the likelihood. Combined with orthogonal generalized Laguerre basis functions, the method achieves minute-scale training on recordings spanning hundreds of neurons and thousands of seconds, enabling synaptic connectivity inference at submillisecond resolution.
Semi-infinite Nonconvex Constrained Min-Max Optimization: For nonconvex min-max optimization problems with infinitely many nonconvex constraints, this paper proposes the iDB-PD (Inexact Dynamic Barrier Primal-Dual) algorithm. Under the Łojasiewicz regularity condition, it establishes the first global non-asymptotic convergence guarantees: stationarity $\mathcal{O}(\epsilon^{-3})$, feasibility $\mathcal{O}(\epsilon^{-6\theta})$, and complementary slackness $\mathcal{O}(\epsilon^{-3\theta/(1-\theta)})$.
Semi-supervised Graph Anomaly Detection via Robust Homophily Learning: This paper proposes RHO (Robust Homophily Learning), which addresses the homophily diversity of normal nodes in semi-supervised graph anomaly detection via an adaptive frequency response filter (AdaFreq) and a Graph Normality Alignment (GNA) module, outperforming existing methods on 8 real-world datasets.
Sharpness-Aware Minimization with Z-Score Gradient Filtering: This paper proposes Z-Score Filtered SAM (ZSAM), which applies per-layer Z-Score statistical filtering to gradient vectors, retaining only the most statistically significant gradient components for the perturbation ascent step. This guides the optimizer toward flat minima more effectively, achieving consistent improvements in test accuracy across multiple datasets and architectures.
Sheaf Cohomology of Linear Predictive Coding Networks: This paper formalizes linear predictive coding (PC) networks as cellular sheaves, proves that PC inference is equivalent to diffusion under the sheaf Laplacian, and employs the Hodge decomposition to factorize supervisory signals into eliminable errors (removed via inference) and irreducible errors (characterized by the cohomology of cyclic topology). This framework precisely explains why certain cyclic weight initializations lead to learning stagnation.
Sign-In to the Lottery: Reparameterized Sparse Training from Scratch: This paper identifies the root cause of poor performance in pruning-at-initialization (PaI) sparse training as the inability to learn correct parameter signs as dense-to-sparse methods do. To address this, the authors propose Sign-In reparameterization ($\theta = m \odot w$), which introduces an internal degree of freedom to facilitate sign flipping. The approach is theoretically shown to resolve a class of sign-flipping scenarios complementary to those addressed by overparameterization, and empirically yields substantial improvements in sparse-from-scratch training.
SPACE: SPike-Aware Consistency Enhancement for Test-Time Adaptation in Spiking Neural Networks: This paper proposes SPACE, the first source-free single-sample test-time adaptation (TTA) method specifically designed for spiking neural networks (SNNs). By maximizing the consistency of spike-based feature maps across augmented views of a test sample, SPACE achieves robust adaptation across multiple datasets and architectures.
Stable Matching with Ties: Approximation Ratios and Learning: This paper studies two-sided matching markets with tied preferences, introduces the Optimal Stable Share (OSS) ratio to measure fairness, proves that the OSS-ratio under distributions over stable matchings is $\Omega(N)$ while under general matching distributions it is $O(\log N)$ (asymptotically tight), and extends the offline approximation results to a bandit learning setting.
Statistical Inference for Gradient Boosting Regression: This paper proposes a unified statistical inference framework for gradient boosting regression. By integrating dropout and parallel training into the Boulevard regularization scheme, the authors establish corresponding central limit theorems, enabling built-in confidence intervals, prediction intervals, and hypothesis tests for variable importance. A key finding is that increasing the dropout rate and the number of parallel trees substantially improves signal recovery—by up to $2\times$ and $4\times$, respectively.
Statistical Inference Under Performativity: This paper establishes the first complete end-to-end statistical inference framework for performative prediction, deriving a central limit theorem and data-driven covariance estimation for repeated risk minimization (RRM) algorithms, and extending prediction-powered inference (PPI) to the dynamic performative setting to obtain tighter confidence intervals.
Structure-Aware Spectral Sparsification via Uniform Edge Sampling: This paper proves that on graphs with sufficiently strong clustering structure (structure ratio $\Upsilon(k)$ large enough), uniform edge sampling suffices to preserve the spectral subspace structure required for spectral clustering, without expensive effective resistance precomputation — providing the first provable guarantee that uniform sampling preserves such structure.
The Computational Complexity of Counting Linear Regions in ReLU Neural Networks: This paper systematically identifies six mutually non-equivalent definitions of "linear regions" in ReLU networks, proves that counting linear regions is #P-hard under all six definitions (even for single-hidden-layer networks), and establishes strong inapproximability results together with polynomial-space upper bounds for deeper networks.
The Cost of Robustness: Tighter Bounds on Parameter Complexity for Robust Memorization in ReLU Nets: This paper studies the parameter complexity of robust memorization in ReLU networks — i.e., the number of parameters required to interpolate an arbitrary dataset while maintaining consistent predictions within a $\mu$-neighborhood of each training sample — and establishes tighter upper and lower bounds across the full range $(0,1)$ of the robustness ratio $\rho = \mu/\epsilon$.
The Parameterized Complexity of Computing the VC-Dimension: This paper systematically investigates the parameterized complexity of computing the VC dimension, establishing that the naive exhaustive algorithm is asymptotically optimal under ETH, presenting an FPT 1-additive approximation algorithm parameterized by maximum degree, an exact $2^{O(\text{tw} \cdot \log \text{tw})} \cdot |V|$ algorithm parameterized by treewidth, and a complete characterization of the tractability landscape across all standard structural parameters.
The Persistence of Neural Collapse Despite Low-Rank Bias: This paper theoretically demonstrates that Deep Neural Collapse (DNC) is globally suboptimal in deep unconstrained feature models due to the low-rank bias induced by L2 regularization, while providing the first theoretical explanation for the persistent empirical occurrence of DNC — its solution-space dimensionality grows faster with network width than that of low-rank solutions.
The Structural Complexity of Matrix-Vector Multiplication: This paper proves that for Boolean matrices $\mathbf{M} \in \{0,1\}^{m \times n}$ with corrupted VC-dimension $d$, matrix-vector multiplication can be performed in $\widetilde{O}(nm^{1-1/d}+m)$ time. This is the first truly sub-quadratic upper bound for structured matrices, refuting the applicability of the OMv conjecture on structured inputs, and yields the first high-accuracy sub-quadratic algorithms for dynamic Laplacian solving, effective resistance, triangle detection, and related problems.
Tight Bounds On the Distortion of Randomized and Deterministic Distributed Voting: This paper studies metric distortion in the distributed voting model. For four cost objectives ($\text{avg-avg}$, $\text{avg-max}$, $\text{max-avg}$, $\text{max-max}$), it establishes improved tight or near-tight bounds under both deterministic and randomized mechanisms, providing an almost complete characterization of distortion in this model.
Training the Untrainable: Introducing Inductive Bias via Representational Alignment: This paper proposes Guidance, a method that transfers the architectural inductive bias of one network (the guide) to another otherwise "untrainable" network (the target) via layer-wise representational alignment (CKA), enabling FCNs to perform image classification and RNNs to approach Transformer-level language modeling performance.
Transfer Learning for Benign Overfitting in High-Dimensional Linear Regression: This paper proposes a two-step Transfer MNI (TM) method that enhances generalization of benign overfitting in overparameterized high-dimensional linear regression via a "preserve target signal + transfer source knowledge in the null space" mechanism. Non-asymptotic excess risk bounds are derived under both model shift and covariate shift, and a "free lunch" covariate shift regime is identified.
Ultrametric Cluster Hierarchies: I Want 'em All!: This paper proves that for any reasonable cluster hierarchy tree, one can efficiently find the optimal solution to any center-based clustering objective (e.g., k-means), and that these solutions are themselves hierarchical — thereby unlocking a large family of equally meaningful hierarchical structures from a single tree.
Uncertainty Estimation by Flexible Evidential Deep Learning: This paper proposes $\mathcal{F}$-EDL, which generalizes the Dirichlet distribution in EDL to a Flexible Dirichlet (FD) distribution for modeling class probability distributions. This approach significantly enhances the generalization of uncertainty estimation under complex scenarios such as noise, long-tail distributions, and distribution shift, while preserving the efficiency of a single forward pass.
Uncertainty Quantification for Reduced-Order Surrogate Models Applied to Cloud Microphysics: This paper proposes the first post-hoc, model-agnostic uncertainty quantification framework for latent-space reduced-order models. By applying conformal prediction to the reconstruction, latent dynamics, and end-to-end prediction components independently, it constructs distribution-free prediction intervals and reveals component-level uncertainty propagation in cloud microphysics ROMs — showing that structural errors in the autoencoder, rather than dynamics errors, dominate end-to-end prediction uncertainty.
UniFormer: Unified and Efficient Transformer for Reasoning Across General and Custom Computing: This paper proposes UniFormer, a unified and efficient Transformer architecture for cross-platform deployment on both GPUs and FPGAs. Through a dual-branch attention mechanism consisting of global linear attention and local block attention, UniFormer achieves high parallelism and compute-memory fusion.
Variational Regularized Unbalanced Optimal Transport: Single Network, Least Action: This paper proposes Var-RUOT, which incorporates the necessary optimality conditions of the Regularized Unbalanced Optimal Transport (RUOT) problem into the parameterization and loss design, enabling the solution of RUOT by learning a single scalar field. The approach yields solutions with lower action and improves training stability, while also analyzing the effect of growth penalty functions on biological priors.
Note 4: WebThinker — Empowering Reasoning Models with Deep Research Capabilities: WebThinker equips large reasoning models (LRMs) with autonomous web search and navigation capabilities. Through a Think-Search-Draft strategy, it seamlessly interleaves reasoning, information gathering, and report generation. After reinforcement learning optimization, it surpasses o1 and Gemini on complex reasoning and scientific report generation tasks.
Weight Weaving: Parameter Pooling for Data-Free Model Merging: This paper proposes Weight Weaving, a plug-and-play data-free model merging enhancement method that eliminates the dependency on evaluation data by pooling model parameters (e.g., via averaging or random selection) over the scaling factor search space. Across three scenarios — multi-task learning, continual learning, and domain generalization — the method achieves an average accuracy improvement of up to 15.9 percentage points.
Zebra: Towards Zero-Shot Cross-Subject Generalization for Universal Brain Visual Decoding: This paper proposes Zebra, the first zero-shot brain visual decoding framework, which disentangles fMRI representations into subject-invariant and semantic-specific components via adversarial training and residual decomposition, enabling cross-subject visual reconstruction generalization without fine-tuning on new subjects.