🔬 ICLR2026 Paper Notes¶

1583 ICLR2026 paper notes covering Image Generation (154), Reinforcement Learning (142), Multimodal VLM (93), Model Compression (92), Medical Imaging (72), LLM Reasoning (71), 3D Vision (65), LLM Evaluation (60) and other 44 areas. Each note has TL;DR, motivation, method, experiments, highlights, and limitations — 5-minute reads of core ideas.

🎨 Image Generation¶

A Hidden Semantic Bottleneck in Conditional Embeddings of Diffusion Transformers: This paper presents the first systematic analysis of conditional embeddings in diffusion Transformers, revealing extreme angular similarity (inter-class cosine similarity >99%) and dimensional sparsity (only 1–2% of dimensions carry semantic information). Pruning 2/3 of low-magnitude dimensions leaves generation quality virtually unchanged, exposing a hidden semantic bottleneck in conditional embeddings.
AlignTok: Aligning Visual Foundation Encoders to Tokenizers for Diffusion Models: This paper proposes AlignTok, which aligns pretrained visual foundation encoders (e.g., DINOv2) into continuous tokenizers for diffusion models. Through a three-stage alignment strategy—semantic latent space establishment → perceptual detail supplementation → decoder refinement—AlignTok constructs a semantically rich latent space, achieving gFID 1.90 on ImageNet 256×256 in only 64 epochs, converging faster and yielding better generation quality than VAEs trained from scratch.
Amortising Inference and Meta-Learning Priors in Neural Networks (BNNP): This paper proposes BNNP (Bayesian Neural Network Process), a neural process that treats BNN weights as latent variables and the BNN itself as the decoder. Through layer-wise amortised variational inference, BNNP jointly learns BNN priors and inference networks across multiple datasets. It is the first work to empirically answer "Does a good prior eliminate the need for a good approximate inference method?"—the answer is no; there is no free lunch.
Asynchronous Denoising Diffusion Models for Aligning Text-to-Image Generation: AsynDM assigns different timestep schedules to different pixels—denoising prompt-relevant regions more slowly—so that these regions can leverage cleaner contextual references, thereby significantly improving semantic alignment in text-to-image generation without requiring any fine-tuning.
Autoregressive Image Generation with Randomized Parallel Decoding: This paper proposes ARPG, a visual autoregressive model built upon a "guided decoding" framework that decouples positional guidance (query) from content representation (key-value), enabling fully randomized-order training and generation with efficient parallel decoding. On ImageNet-1K 256×256, ARPG achieves 1.94 FID in 64 steps with over 20× throughput improvement and over 75% memory reduction.
Beyond Confidence: The Rhythms of Reasoning in Generative Models: This paper proposes the Token Constraint Bound ($\delta_{\text{TCB}}$) metric, which quantifies the largest perturbation to an LLM's hidden state that preserves the next-token prediction, measuring local prediction robustness and revealing instabilities that traditional perplexity fails to capture.
Blueprint-Bench: Comparing Spatial Intelligence of LLMs, Agents and Image Models: Blueprint-Bench evaluates AI spatial reasoning through the task of "generating 2D floorplans from apartment interior photographs": the inputs (photos) are fully within the training distribution, while the task (spatial reconstruction) is out-of-distribution. The benchmark evaluates LLMs including GPT-5, Claude 4 Opus, Gemini 2.5 Pro, and Grok-4; image generation models including GPT-Image and NanoBanana; and agent systems including Codex CLI and Claude Code. Results show that the vast majority of models perform at or below a random baseline, revealing a systematic blind spot in current AI spatial intelligence.
Branched Schrödinger Bridge Matching: This paper proposes BranchSBM, a framework that extends Schrödinger Bridge Matching to branching scenarios by parameterizing multiple time-dependent velocity fields and growth processes. It models bifurcating dynamics from a single initial distribution to multiple target distributions, significantly outperforming single-branch methods on tasks such as LiDAR surface navigation and single-cell perturbation modeling.
Bridging Degradation Discrimination and Generation for Universal Image Restoration: BDG performs fine-grained degradation discrimination via multi-angle multi-scale gray-level co-occurrence matrices (MAS-GLCM), and designs a three-stage diffusion training pipeline (generation → bridging → restoration) to seamlessly integrate degradation discrimination with generative priors, achieving significant fidelity improvements on all-in-one restoration and real-world super-resolution tasks.
Bridging Generalization Gap of Heterogeneous Federated Clients Using Generative Models: FedVTC proposes that, in model-heterogeneous federated learning, each client generates synthetic data via a Variational Transposed Convolution network (VTC) from aggregated feature distribution statistics to fine-tune the local model. Without requiring a public dataset, the method significantly improves generalization while reducing communication and memory overhead.
CMT: Mid-Training for Efficient Learning of Consistency, Mean Flow, and Flow Map Models: This paper proposes Consistency Mid-Training (CMT), which inserts a lightweight intermediate training stage between a pretrained diffusion model and flow map post-training. By training the model to map arbitrary points on ODE trajectories back to clean samples, CMT yields trajectory-aligned initialization, reducing training cost by up to 98% while achieving state-of-the-art two-step generation quality.
Compose Your Policies! Improving Diffusion-based or Flow-based Robot Policies via Test-time Distribution-level Composition: This paper proposes General Policy Composition (GPC), which at test time convexly combines the distribution scores of multiple pretrained diffusion/flow policies without additional training, yielding a composite policy that surpasses any individual parent policy. Theoretical analysis proves that convex combination improves single-step score error, and this improvement propagates to the full sampling trajectory via a Grönwall bound.
Compositional amortized inference for large-scale hierarchical Bayesian models: This paper extends compositional score matching (CSM) to hierarchical Bayesian models, addressing numerical instability under large numbers of data groups via a novel error-damping estimator and mini-batch strategy. It achieves, for the first time, amortized inference over hierarchical models exceeding 750,000 parameters (250,000+ data groups), validated on a real-world fluorescence lifetime imaging application.
Concept-TRAK: Understanding how diffusion models learn concepts through concept-level attribution: This paper proposes Concept-TRAK, which extends influence functions from image-level to concept-level attribution by designing concept-specific training losses (DPS reward) and utility losses (CFG guidance). The method substantially outperforms TRAK, D-TRAK, and DAS on synthetic, CelebA-HQ, and AbC benchmarks, with particularly significant advantages in OOD settings where novel concept combinations are evaluated.
Condition Errors Refinement in Autoregressive Image Generation with Diffusion Loss: This paper theoretically analyzes the advantage of autoregressive diffusion loss models over conditional diffusion models in correcting condition errors (exponential decay of gradient norms), and proposes a condition refinement method based on optimal transport (Wasserstein Gradient Flow) to address the "condition inconsistency" problem in autoregressive generation, achieving FID 1.31 on ImageNet (based on MAR).
Conditionally Whitened Generative Models for Probabilistic Time Series Forecasting: This paper proposes CW-Gen (Conditionally Whitened Generative Models), which replaces the standard Gaussian terminal distribution in diffusion models and flow matching by jointly estimating the conditional mean and a sliding-window covariance matrix. The authors provide theoretical guarantees showing that sampling quality is necessarily improved when the estimator satisfies sufficient conditions, and demonstrate consistent improvements in multivariate time series probabilistic forecasting across 5 datasets × 6 generative models.
Conjuring Semantic Similarity: This paper proposes a vision-imagination-based measure of textual semantic similarity by computing the Jeffreys divergence between the reverse SDEs induced by a text-conditioned diffusion model under two text prompts. The metric is directly computable via Monte-Carlo sampling and, for the first time, quantifies the alignment between the semantic space learned by diffusion models and human annotations.
Consistent Text-to-Image Generation via Scene De-Contextualization: This paper identifies the root cause of identity (ID) shift in T2I models as scene contextualization — the injection of contextual information from scene tokens into ID tokens — and proposes a training-free method, Scene De-Contextualization (SDeC), that leverages SVD singular value directional stability analysis to identify and suppress latent scene–ID associations in prompt embeddings, enabling per-scene identity-consistent generation.
Contact-Guided 3D Genome Structure Generation of E. coli via Diffusion Transformers: This paper proposes DiffBacChrom — a conditional diffusion Transformer (CrossDiT) that generates ensembles of 3D genome conformations for E. coli from Hi-C contact maps. The method employs a ResNet VAE to maintain bin-aligned latent encodings, a Transformer encoder with cross-attention for Hi-C conditioning, and flow-matching training. The generated ensembles exhibit high agreement with input Hi-C data in terms of distance-decay curves $P(s)$ and SCC metrics, while preserving conformational diversity.
Contact Wasserstein Geodesics for Non-Conservative Schrödinger Bridges: This paper proposes the Non-Conservative Generalized Schrödinger Bridge (NCGSB)—built on contact Hamiltonian mechanics to allow time-varying energy—and introduces the Contact Wasserstein Geodesic (CWG), which reformulates the bridge problem as geodesic computation on a finite-dimensional Jacobi metric. A ResNet parameterization achieves near-linear complexity and supports guided generation. CWG substantially outperforms iterative SB solvers on manifold navigation, molecular dynamics, and image generation tasks.
ContextBench: Modifying Contexts for Targeted Latent Activation: This paper introduces ContextBench, a benchmark comprising 715 tasks for evaluating methods that automatically generate fluent inputs capable of activating specific latent features, and proposes two EPO-enhanced variants—LLM-assisted and diffusion-model inpainting—that Pareto-dominate standard EPO in the trade-off between activation strength and linguistic fluency.
Continual Unlearning for Text-to-Image Diffusion Models: A Regularization Perspective: This paper presents the first systematic study of continual unlearning for text-to-image (T2I) diffusion models. It identifies that existing unlearning methods suffer from "utility collapse" under sequential unlearning requests due to cumulative parameter drift, and proposes a suite of plug-in regularization strategies (L1/L2 norm, selective fine-tuning, model merging) along with a semantics-aware gradient projection method to mitigate this issue.
Contractive Diffusion Policies: Robust Action Diffusion via Contractive Score-Based Sampling with Differential Equations: This paper proposes Contractive Diffusion Policies (CDPs), which introduce contraction regularization into the diffusion sampling ODE to suppress the accumulation of score matching errors and solver errors. With minimal modification and a single hyperparameter $\gamma$, CDPs improve the robustness of diffusion-based policies in offline learning settings.
COSMO-INR: Complex Sinusoidal Modulation for Implicit Neural Representations: Through harmonic distortion analysis and Chebyshev polynomial approximation, this paper rigorously proves that odd/even symmetric activation functions exhibit systematic attenuation in post-activation spectra. It proposes modulating activation functions with complex sinusoidal terms $e^{j\zeta x}$ to preserve full spectral support, and introduces the COSMO-RC activation function alongside a regularized prior embedder architecture. The method achieves an average PSNR gain of +5.67 dB over the strongest baseline on Kodak image reconstruction and +3.45 dB on NeRF.
CREPE: Controlling Diffusion with Replica Exchange: This paper proposes CREPE, an inference-time control method for diffusion models based on Replica Exchange (Parallel Tempering), serving as the computational dual of SMC — it operates in parallel across denoising steps while generating samples serially. CREPE offers high sample diversity, supports online refinement, and handles a variety of tasks including temperature annealing, reward tilting, model composition, and CFG debiasing.
DenseGRPO: From Sparse to Dense Reward for Flow Matching Model Alignment: This paper addresses the sparse reward problem in Flow Matching + GRPO alignment by estimating step-wise reward gains as dense rewards via ODE denoising rollouts of intermediate latents, and adaptively calibrating per-timestep noise injection in the SDE sampler based on dense rewards to regulate exploration. The method outperforms Flow-GRPO on three tasks: human preference alignment, compositional generation, and text rendering.
Detecting and Mitigating Memorization in Diffusion Models through Anisotropy of the Log-Probability: This paper demonstrates that norm-based memorization detection metrics are valid only under isotropic log-probability distributions and fail in low-noise anisotropic regimes. A denoising-free detection metric is proposed that combines high-noise norms with low-noise angular alignment (cosine similarity), surpassing existing denoising-free methods on SD v1.4/v2.0 while being over 5× faster.
DiffInk: Glyph- and Style-Aware Latent Diffusion Transformer for Text to Online Handwriting Generation: This paper proposes DiffInk, the first latent diffusion Transformer framework for full-line handwriting generation, comprising InkVAE (which learns a structured latent space via dual regularization from OCR and style classification) and InkDiT (which performs conditional denoising generation in the latent space). DiffInk substantially outperforms the state of the art on Chinese handwriting generation (AR 94.38% vs. 91.48%) while achieving an 800× speedup.
Diffusion Alignment as Variational Expectation-Maximization: This paper formalizes diffusion model alignment as a variational EM algorithm: the E-step employs test-time search (soft-Q-guided sampling with importance sampling) to explore multimodal, high-reward trajectories, while the M-step distills the search results into model parameters via forward-KL minimization. The approach simultaneously achieves high reward and high diversity on both image generation and DNA sequence design tasks.
Diffusion Blend: Inference-Time Multi-Preference Alignment for Diffusion Models: This paper proposes Diffusion Blend, which achieves multi-preference alignment at inference time by blending the backward diffusion processes of multiple reward-finetuned models. DB-MPA supports arbitrary linear combinations of rewards; DB-KLA enables dynamic KL regularization control; DB-MPA-LS eliminates inference overhead via stochastic LoRA sampling. The paper theoretically derives error bounds for the blending approximation and empirically approaches the MORL oracle upper bound.
Diffusion Fine-Tuning via Reparameterized Policy Gradient of the Soft Q-Function: This paper proposes SQDF (Soft Q-based Diffusion Finetuning), which fine-tunes diffusion models under a KL-regularized RL framework via a training-free differentiable soft Q-function approximation and reparameterized policy gradients. Three complementary components—a discount factor, a consistency model, and an off-policy replay buffer—collectively optimize the target reward while effectively mitigating reward over-optimization, preserving sample naturalness and diversity.
DiffusionNFT: Online Diffusion Reinforcement with Forward Process: DiffusionNFT proposes a fundamentally new online RL paradigm for diffusion models: rather than performing policy optimization over the reverse sampling process (as in GRPO), it performs contrastive training on positive and negative samples via a flow matching objective over the forward process, defining an implicit policy improvement direction. The method is 3–25× faster than FlowGRPO and requires no CFG.
Direct Reward Fine-Tuning on Poses for Single Image to 3D Human in the Wild: This paper proposes DrPose, which applies direct reward fine-tuning to maximize PoseScore—a metric measuring skeletal consistency between multi-view latent images and ground-truth 3D poses—combined with KL regularization to prevent reward hacking. Together with the DrPose15K dataset (15K diverse poses sampled from the Motion-X dataset and animated via the MIMO video generator), DrPose significantly improves 3D human reconstruction quality under challenging poses such as dynamic movements and acrobatics.
Directional Textual Inversion for Personalized Text-to-Image Generation: This paper identifies a norm inflation problem in token embeddings learned by Textual Inversion (TI), which degrades text alignment under complex prompts. The proposed Directional Textual Inversion (DTI) fixes the embedding norm at an in-distribution scale and optimizes only the direction on the unit hypersphere via Riemannian SGD, regularized by a von Mises-Fisher prior, substantially improving prompt faithfulness.
Discrete Adjoint Matching: This paper proposes Discrete Adjoint Matching (DAM), which derives adjoint variables for discrete state spaces from a purely statistical perspective (rather than from control theory), extending the continuous-domain Adjoint Matching framework to discrete generative models based on continuous-time Markov chains (CTMCs). The approach enables effective fine-tuning of diffusion-based LLMs (LLaDA-8B), improving accuracy on Sudoku from 11.5% to 89.2%.
DistillKac: Few-Step Image Generation via Damped Wave Equations: This paper replaces the Fokker-Planck equation with the damped wave equation (telegrapher's equation) and its stochastic Kac representation as the probabilistic flow foundation for generative models, enabling finite-speed propagation. An endpoint distillation method is proposed for few-step generation, achieving FID=4.14 in 4 steps and FID=5.66 in 1 step on CIFAR-10.
Diverse Text-to-Image Generation via Contrastive Noise Optimization: This paper proposes Contrastive Noise Optimization (CNO), which applies an InfoNCE contrastive loss over the Tweedie denoised prediction space to optimize initial noise vectors as a preprocessing step, improving the generation diversity of diffusion models while maintaining fidelity, without modifying the sampling process or model parameters.
Does FLUX Already Know How to Perform Physically Plausible Image Composition?: This paper proposes SHINE, a training-free image composition framework that leverages the intrinsic physical priors of pretrained T2I models (e.g., FLUX) via three components — Manifold-Steered Anchor Loss, Degradation-Suppression Guidance, and Adaptive Background Blending — to achieve high-quality object insertion under complex lighting conditions (shadows, water reflections, etc.).
Does Semantic Noise Initialization Transfer from Images to Videos? A Paired Diagnostic Study: Through rigorous prompt-level paired statistical testing, this work finds that transferring semantic noise initialization (golden noise) from the image domain to video diffusion models yields a marginally positive but statistically insignificant trend on temporal metrics (p≈0.17). Noise-space diagnostics reveal that insufficient directional stability and spatiotemporal frequency structure discrepancies are the root causes.
DoFlow: Flow-based Generative Models for Interventional and Counterfactual Forecasting: This paper proposes DoFlow, a causal generative model based on continuous normalizing flows (CNF) that unifies observational, interventional, and counterfactual time series forecasting over a causal DAG. The model additionally supports anomaly detection via explicit likelihood estimation, and is validated on both synthetic and real-world medical datasets.
DragFlow: Unleashing DiT Priors with Region Based Supervision for Drag Editing: The first framework to incorporate the strong generative priors of FLUX (DiT) into drag editing. By replacing conventional point-level supervision with region-level affine supervision, combined with gradient-mask hard constraints and adapter-enhanced inversion, the method substantially improves drag editing quality.
Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing: This paper identifies a role imbalance in existing unified multimodal models, where the understanding module merely acts as a translator while the generation module is forced to simultaneously serve as both "designer" and "painter." By constructing the DIM dataset (14M long-context text-image pairs + 233K CoT editing blueprints), design responsibilities are transferred to the understanding module. The resulting 4.6B-parameter model surpasses models five times its size.
Dual-Solver: A Generalized ODE Solver for Diffusion Models with Dual Prediction: This paper proposes Dual-Solver, which generalizes multi-step samplers for diffusion models via three sets of learnable parameters — prediction type interpolation $\gamma$, integration domain selection $\tau$, and residual adjustment $\kappa$ — and learns these parameters using the classification loss of frozen pretrained classifiers (MobileNet/CLIP) without requiring teacher trajectories. The method consistently outperforms DPM-Solver++ and related approaches in the low-NFE regime (3–9 NFE).
Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?: This paper proposes T2I-CoReBench, the first comprehensive benchmark that systematically evaluates both compositional ability (Composition) and reasoning ability (Reasoning) of T2I models. It covers 12 evaluation dimensions, 1,080 high-difficulty prompts, and approximately 13,500 checklist questions. Large-scale evaluation of 38 models reveals that reasoning capability lags far behind compositional capability, constituting the primary bottleneck in current T2I generation.
EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing: This paper constructs EditReward-Data, a high-quality dataset of 200K expert-annotated preference pairs, and trains the EditReward reward model, which achieves state-of-the-art human alignment across multiple image editing evaluation benchmarks. The model is further validated as a data filter that substantially improves downstream editing model performance.
EditScore: Unlocking Online RL for Image Editing via High-Fidelity Reward Modeling: This paper proposes the first systematic "benchmark evaluation → reward modeling → reinforcement learning training" pipeline for image editing: constructing the EditReward-Bench benchmark, training the EditScore reward model series (7B–72B, surpassing GPT-5), and successfully applying it to Online RL training to significantly improve editing model performance.
Efficient Adversarial Attacks on High-dimensional Offline Bandits: This paper exposes a security vulnerability in offline multi-armed bandit (MAB) evaluation frameworks: an attacker can completely hijack a bandit's decision-making behavior by applying imperceptibly small perturbations to publicly available reward model weights. The required perturbation norm decreases as input dimensionality increases ($\widetilde{\mathcal{O}}(d^{-1/2})$), rendering image-based generative model evaluation pipelines particularly vulnerable.
Eliminating VAE for Fast and High-Resolution Generative Detail Restoration: By replacing the VAE encoder and decoder with ×8 pixel-(un)shuffle operations, this work converts latent-space diffusion super-resolution (GenDR) into pixel-space super-resolution (GenDR-Pix). Combined with multi-stage adversarial distillation and a PadCFG inference strategy, the method achieves 2.8× speedup and 60% memory reduction with negligible visual degradation, enabling 4K image restoration within 1 second using only 6 GB of VRAM for the first time.
Embracing Discrete Search: A Reasonable Approach to Causal Structure Learning: This paper proposes FLOP (Fast Learning of Order and Parents), a score-based causal discovery algorithm for linear models. By introducing fast parent selection and iterative Cholesky-based score updates, FLOP substantially reduces runtime, rendering Iterated Local Search (ILS) practical. It achieves near-perfect graph recovery on standard causal discovery benchmarks, reestablishing discrete search as a principled and competitive approach in causal discovery.
Error as Signal: Stiffness-Aware Diffusion Sampling via Embedded Runge-Kutta Guidance: This paper proposes ERK-Guid, which leverages the order-difference error of embedded Runge-Kutta solvers as a guidance signal to adaptively correct local truncation error (LTE) in stiff regions, improving diffusion model sampling quality without additional network evaluations.
Event-T2M: Event-level Conditioning for Complex Text-to-Motion Synthesis: This paper proposes the Event-T2M framework, which decomposes text prompts into event-level atomic actions and injects them into a Conformer-based diffusion model via a TMR encoder and an Event-level Cross-Attention (ECA) module, significantly improving generation quality and semantic alignment for complex multi-event motion synthesis.
Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models: This paper proposes SpatialGenEval, a benchmark comprising 1,230 long, information-dense prompts spanning 10 spatial sub-domains, for systematically evaluating the spatial intelligence of 23 state-of-the-art T2I models. The benchmark reveals that spatial reasoning is the primary bottleneck. The authors additionally construct the SpatialT2I dataset to enable data-centric improvement of spatial intelligence.
Evolutionary Caching to Accelerate Your Off-the-Shelf Diffusion Model: This paper proposes ECAD (Evolutionary Caching to Accelerate Diffusion models), which employs a genetic algorithm to automatically search for optimal caching schedules along the speed–quality Pareto frontier. Without modifying model parameters and using only 100 calibration prompts, ECAD achieves 2–3× inference speedup while maintaining or even improving generation quality.
Exposing Hidden Biases in Text-to-Image Models via Automated Prompt Search: This paper proposes Bias-Guided Prompt Search (BGPS), which combines LLM decoding guidance with a diffusion model intermediate-layer attribute classifier to automatically discover interpretable text prompts that maximally expose hidden social biases in T2I models—revealing residual biases even in debiased models.
Factuality Matters: When Image Generation and Editing Meet Structured Visuals: The first systematic study on the generation and editing of structured images (charts, mathematical figures, diagrams, tables, etc.), contributing a 1.3M-pair code-aligned training dataset with CoT reasoning annotations, a unified VLM+diffusion model architecture, and the StructBench benchmark with 1,700+ samples. The work reveals that reasoning capability is the key bottleneck for current models in handling structured visual content.
SSCP: Flow-Based Single-Step Completion for Efficient and Expressive Policy Learning: This paper proposes the Single-Step Completion Policy (SSCP), which compresses multi-step generative policies into single-step inference by predicting a "completion vector" (the normalized direction from any intermediate state to the target action) within a flow matching framework. On D4RL, SSCP matches multi-step diffusion/flow policies while achieving 64× faster training and 4.7× faster inference, and extends to GCRL to flatten hierarchical policies.
Flow2GAN: Hybrid Flow Matching and GAN with Multi-Resolution Network for Few-step High-Fidelity Audio Generation: This paper proposes Flow2GAN, a two-stage training framework that first employs an improved Flow Matching objective to learn generative capabilities, then fine-tunes with a GAN to achieve few-step (1/2/4 steps) high-fidelity audio generation, combined with a multi-resolution network architecture that processes Fourier coefficients at different time-frequency resolutions.
Flow Matching with Injected Noise for Offline-to-Online Reinforcement Learning: By injecting controllable noise into flow matching training to broaden policy coverage, and combining an entropy-guided sampling mechanism to dynamically balance exploration and exploitation during online fine-tuning, FINO significantly improves sample efficiency in offline-to-online RL under limited interaction budgets.
FlowCast: Advancing Precipitation Nowcasting with Conditional Flow Matching: This work is the first to apply Conditional Flow Matching (CFM) as an end-to-end probabilistic generative model for precipitation nowcasting. By learning a direct noise-to-data mapping in a compressed latent space, the proposed method surpasses diffusion-based models in both predictive accuracy and probabilistic performance with significantly fewer sampling steps.
FlowCast: Trajectory Forecasting for Scalable Zero-Cost Speculative Flow Matching: FlowCast is a framework that introduces speculative decoding into Flow Matching models. It exploits the local smoothness of the velocity field to extrapolate future states using the current velocity prediction as a zero-cost draft, then selectively skips redundant steps via MSE-based verification, achieving >2.5× speedup without quality degradation.
Follow-Your-Shape: Shape-Aware Image Editing via Trajectory-Guided Region Control: This paper proposes Follow-Your-Shape, a training-free and mask-free shape-aware editing framework. It constructs a Trajectory Divergence Map (TDM) by computing token-level velocity discrepancies between inversion and editing trajectories to precisely localize editing regions, and employs a staged KV injection strategy to achieve large-scale shape transformations while strictly preserving the background.
Free Lunch for Stabilizing Rectified Flow Inversion: This paper proposes PMI (Proximal-Mean Inversion) and mimic-CFG, two training-free methods that stabilize Rectified Flow inversion by applying proximal gradient correction toward the historical mean of the velocity field. On PIE-Bench, both methods achieve state-of-the-art reconstruction and editing quality with fewer NFEs.
From Parameters to Behaviors: Unsupervised Compression of the Policy Space: Based on the manifold hypothesis, this paper proposes unsupervised compression of the policy space—training an autoencoder with a behavioral reconstruction loss (rather than a parameter reconstruction loss) to compress the high-dimensional policy parameter space $\Theta \subseteq \mathbb{R}^P$ into a low-dimensional latent behavioral space $\mathcal{Z} \subseteq \mathbb{R}^k$ (up to a 121801:1 compression ratio). Experiments on Mountain Car, Reacher, Hopper, and HalfCheetah demonstrate that the intrinsic dimensionality of the behavioral manifold is determined by environment complexity rather than network size, and that PGPE optimization in the latent space converges faster than PPO, SAC, and other SOTA baselines on 7 out of 8 tasks.
From Prediction to Perfection: Introducing Refinement to Autoregressive Image Generation: This paper proposes TensorAR, which upgrades standard AR image generation from next-token prediction to next-tensor prediction: each step predicts an overlapping tensor (a group of consecutive tokens), and subsequent tensors overlap with preceding ones to enable iterative refinement. A discrete diffusion noise mechanism is introduced to address training information leakage. TensorAR serves as a plug-and-play module compatible with AR models such as LlamaGen, Open-MAGVIT2, and Janus-Pro, consistently improving generation quality on both class-to-image and text-to-image tasks.
GenCP: Towards Generative Modeling Paradigm of Coupled Physics: This paper proposes GenCP, which reformulates coupled multiphysics simulation as a probability density evolution problem. It leverages flow matching to learn conditional velocity fields from decoupled data, and synthesizes coupled solutions at inference time via Lie-Trotter operator splitting—realizing "decoupled training, coupled inference" with theoretically bounded error guarantees.
GenDR: Lighten Generative Detail Restoration: GenDR is proposed as a lightweight single-step diffusion super-resolution model for generative detail restoration. It identifies the fundamental divergence between T2I and SR objectives (T2I requires multi-step + 4-channel vs. SR requires fewer steps + 16-channel) → builds a customized SD2.1-VAE16 foundation model (0.9B, extending the latent space via REPA representation alignment without increasing model size) → introduces CiD/CiDA consistency score identity distillation (integrating SR-specific priors into score distillation + adversarial learning + representation alignment) → delivers a minimal pipeline containing only UNet + VAE, achieving 77ms inference while surpassing existing SOTA on all quality and efficiency metrics.
Generalization of Diffusion Models Arises with a Balanced Representation Space: This paper represents a significant advance in the theory of diffusion model generalization. By analyzing the optimal solutions of two-layer nonlinear ReLU denoising autoencoders (DAEs), it provides a unified characterization of both memorization and generalization, and introduces a novel representation-centric perspective on generalization. The theoretical findings are consistently validated on EDM, DiT, and Stable Diffusion v1.4, and give rise to two practical applications: memorization detection and controllable editing. The work achieves both theoretical depth and practical utility.
Generate Any Scene: Scene Graph Driven Data Synthesis for Visual Generation Training: This paper proposes the Generate Any Scene data engine, which systematically enumerates scene graphs from a visual element taxonomy comprising 28K objects × 1.5K attributes × 10K relations, and converts them into caption–VQA pairs. The engine supports four applications: self-improvement (SD1.5 +4%), targeted distillation (<800 samples, TIFA +10%), a scene-graph reward model (DPG-Bench +5% vs. CLIP), and content moderation enhancement.
Generating Directed Graphs with Dual Attention and Asymmetric Encoding: This paper proposes Directo, the first directed graph generation model based on Discrete Flow Matching (DFM), which captures directional dependencies of directed edges via a direction-aware dual attention mechanism and asymmetric positional encoding, while establishing a standardized evaluation benchmark for directed graph generation.
GeoDiv: Framework for Measuring Geographical Diversity in Text-to-Image Models: This paper proposes GeoDiv, a framework that leverages the world knowledge embedded in LLMs and VLMs to systematically evaluate the geographical diversity of T2I models along two dimensions — the Socioeconomic Visual Index (SEVI) and the Visual Diversity Index (VDI) — revealing systematic impoverishment biases in model outputs for countries such as India and Nigeria.
GGBall: Graph Generative Model on Poincaré Ball: This paper proposes GGBall, the first graph generation framework operating entirely on the Poincaré ball model. By combining a hyperbolic vector-quantized variational autoencoder (HVQVAE) with a Riemannian flow matching prior, GGBall achieves state-of-the-art performance on both hierarchical and molecular graph generation, reducing the average generation error by 18% on hierarchical graph benchmarks.
GLASS Flows: Efficient Inference for Reward Alignment of Flow and Diffusion Models: This paper proposes GLASS (Gaussian Latent Sufficient Statistic) Flows — a novel "flow within a flow" sampling paradigm that recasts the stochastic Markov transition $p_{t'|t}(x_{t'} | x_t)$ as an internal ODE problem via Gaussian sufficient statistic reparameterization, reusing the pretrained denoiser without retraining. This enables Feynman-Kac Steering without sacrificing ODE efficiency or SDE stochasticity, consistently surpassing the Best-of-N ODE baseline on the FLUX text-to-image model and achieving a new state of the art in inference-time reward alignment.
Hierarchical Entity-centric Reinforcement Learning with Factored Subgoal Diffusion: This paper proposes HECRL, a hierarchical entity-centric offline goal-conditioned RL framework that combines a value-based GCRL agent with a factored subgoal diffusion model, achieving 150%+ success rate improvements on multi-entity long-horizon tasks.
HierLoc: Hyperbolic Entity Embeddings for Hierarchical Visual Geolocation: HierLoc reformulates visual geolocation as an image-to-entity alignment problem in hyperbolic space, replacing 5M+ image embeddings with ~240K geographic entity embeddings. It achieves a 19.5% reduction in mean geodesic error and a 43% improvement in sub-region accuracy on OSV5M.
HOG-Diff: Higher-Order Guided Diffusion for Graph Generation: This paper proposes HOG-Diff, a graph diffusion framework that leverages higher-order topological structures (e.g., rings, triangles, motifs) as generative guidance. By extracting higher-order skeletons via Cell Complex Filtration (CCF) and combining them with a generalized OU diffusion bridge, the framework realizes coarse-to-fine progressive graph generation, achieving state-of-the-art performance on 8 benchmarks for both molecular and general graph generation.
Image Can Bring Your Memory Back: A Novel Multi-Modal Guided Attack against Image Generation Model Unlearning: Recall proposes the first multi-modal guided attack framework that optimizes adversarial image prompts in the latent space using a single reference image. Combined with the original text prompt, it exploits the image-conditioning channel of diffusion models and achieves an average ASR of 65%–97% across 10 SOTA unlearning methods, substantially outperforming text-only attack methods and exposing the vulnerability of current unlearning mechanisms to image-modality attacks.
Improved Object-Centric Diffusion Learning with Registers and Contrastive Alignment (CODA): This paper proposes CODA, a framework that addresses slot entanglement and weak alignment in diffusion-based object-centric learning by introducing register slots to absorb residual attention, fine-tuning cross-attention projections, and applying a contrastive alignment loss. CODA achieves substantial improvements in object discovery and compositional generation quality on both synthetic and real-world datasets.
Improving Discrete Diffusion Unmasking Policies Beyond Explicit Reference Policies (UPO): This paper proposes Unmasking Policy Optimization (UPO), which formalizes the denoising process of Masked Diffusion Models (MDMs) as a KL-regularized Markov Decision Process and trains a lightweight unmasking policy model via reinforcement learning to replace heuristic schedulers such as max-confidence. Both theoretical analysis and empirical results demonstrate that the learned policy generates samples closer to the true data distribution.
Infinity and Beyond: Compositional Alignment in VAR and Diffusion T2I Models: This paper presents the first systematic comparison of Visual Autoregressive (VAR) models and diffusion models on compositional text-image alignment. Evaluating 6 T2I models across T2I-CompBench++ and GenEval benchmarks, it finds that Infinity-8B achieves state-of-the-art performance on nearly all compositional dimensions, demonstrating a clear architectural advantage of VAR models in compositional generation.
Intention-Conditioned Flow Occupancy Models: This paper proposes InFOM, which leverages flow matching to construct an intention-conditioned occupancy model. By applying variational inference to infer latent intentions from unannotated data, InFOM enables RL pre-training without labeled datasets, achieving a 1.8× improvement in median return and a 36% gain in success rate across 36 state-based tasks and 4 visual tasks.
JointDiff: Bridging Continuous and Discrete in Multi-Agent Trajectory Generation: This paper proposes JointDiff, a joint continuous-discrete diffusion framework that, for the first time, unifies Gaussian diffusion (for trajectories) and multinomial diffusion (for ball-possession events) in a single model. It further introduces a CrossGuid module to support weak possession guidance and text-guided semantic controllable generation, achieving state-of-the-art performance on multi-agent trajectory generation in sports scenarios.
Laplacian Multi-scale Flow Matching for Generative Modeling: This paper proposes LapFlow, which decomposes images into Laplacian pyramid residuals and models different scales in parallel via a Mixture-of-Transformers (MoT) architecture with causal attention, reducing computational cost while improving generation quality.
Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency: This paper proposes rCM (score-regularized continuous-time consistency model), which for the first time scales continuous-time consistency distillation to 14B-parameter text-to-image/video models. By combining forward KL divergence (consistency) with reverse KL divergence (score distillation), rCM matches DMD2 in quality while preserving diversity, achieving 15–50× inference speedup.
Latent Diffusion Model without Variational Autoencoder: This paper proposes SVG, which replaces the VAE latent space with frozen DINOv3 self-supervised features for diffusion model training. A lightweight residual encoder supplements fine-grained details, enabling faster training, more efficient inference, and a unified visual representation applicable across tasks.
Learning a Distance Measure from the Information-Estimation Geometry of Data: This paper proposes the Information-Estimation Metric (IEM), a novel distance function induced by the geometry of the data probability density. IEM measures the distance between signals by comparing their score vector fields at multiple noise levels. Without any supervised training, IEM achieves perceptual judgment prediction performance competitive with fully supervised methods.
LLM2Fx-Tools: Tool Calling for Music Post-Production: This paper proposes LLM2Fx-Tools, the first framework that applies LLM tool calling to audio effect modules. It leverages a multimodal LLM to understand audio inputs, employs CoT reasoning to select effect types, determine processing order, and estimate parameters, enabling interpretable and controllable music post-production.
Locality-aware Parallel Decoding for Efficient Autoregressive Image Generation: This paper proposes Locality-aware Parallel Decoding (LPD), which reduces the number of generation steps for 256×256 images from 256 to 20 by flexibly parallelizing autoregressive modeling architectures and employing a locality-aware generation order schedule, achieving at least 3.4× latency reduction.
Localized Concept Erasure in Text-to-Image Diffusion Models via High-Level Representation Misdirection: HiRM introduces a concept erasure strategy that decouples the update location from the erasure target — updating only the first-layer weights of the CLIP text encoder while imposing erasure supervision on the high-level semantic representations at the final layer. By steering target concept representations toward random directions (HiRM-R) or semantically meaningful directions (HiRM-S), the method achieves effective erasure of styles, objects, and NSFW content on the UnlearnCanvas and NSFW benchmarks, with zero-shot transferability to the Flux architecture.
Loopholing Discrete Diffusion: Deterministic Bypass of the Sampling Wall: This paper identifies the "sampling wall" in discrete diffusion models—whereby rich categorical distribution information collapses into one-hot vectors after sampling—and proposes a Loopholing mechanism that introduces a deterministic latent pathway to propagate distribution information across steps. The approach reduces generation perplexity by up to 61%, substantially closing the gap with autoregressive models.
LVTINO: LAtent Video consisTency INverse sOlver for High Definition Video Restoration: This paper proposes LVTINO, the first zero-shot video inverse problem solver built upon a Video Consistency Model (VCM) prior. By injecting measurement consistency constraints—without requiring automatic differentiation—into the VCM sampling process, LVTINO achieves perceptual quality and temporal consistency surpassing frame-wise image methods across multiple video inverse problems (super-resolution, deblurring, inpainting) with a minimal number of neural function evaluations (NFEs).
MAC-AMP: A Closed-Loop Multi-Agent Collaboration System for Multi-Objective Antimicrobial Peptide Design: This paper proposes MAC-AMP, the first closed-loop multi-agent collaboration system that reformulates antimicrobial peptide (AMP) design as a coordinated multi-agent optimization problem, achieving multi-objective optimization through AI-simulated peer review and adaptive reward design.
Market Games for Generative Models: Equilibria, Welfare, and Strategic Entry: This paper formalizes a three-tier model–platform–user market game, analyzes the existence conditions of pure-strategy Nash equilibria under generative model competition, characterizes market structure and social welfare implications, and designs optimal entry strategies for model providers.
Mod-Adapter: Tuning-Free and Versatile Multi-concept Personalization via Modulation Adapter: This paper proposes Mod-Adapter, a tuning-free multi-concept personalization method that predicts concept-specific modulation directions in the modulation space of DiT, enabling decoupled customization of both object and abstract concepts (pose, lighting, material, etc.), substantially outperforming existing methods on multi-concept personalization benchmarks.
Model Collapse Is Not a Bug but a Feature in Machine Unlearning for LLMs: This paper repositions "model collapse"—commonly regarded as a negative phenomenon—as a tool for machine unlearning, proposing the PMC method. By iteratively fine-tuning on retained data and the model's own generated outputs, PMC achieves targeted information removal without directly optimizing over the forget targets, and validates its effectiveness through both theoretical analysis and empirical experiments.
MOLM: Mixture of LoRA Markers: This paper proposes MOLM, a watermarking framework that reinterprets LoRA adapters as watermark carriers. A binary key-driven routing mechanism embeds verifiable and robust watermarks into a frozen generative model without per-key retraining.
Monocular Normal Estimation via Shading Sequence Estimation: This paper proposes RoSE, which reformulates monocular normal estimation as a shading sequence estimation problem. An image-to-video (I2V) generative model is used to predict shading sequences under multiple illuminations, and a simple ordinary least squares solver then converts the shading sequence into a normal map. RoSE achieves state-of-the-art performance on real-world benchmark datasets.
Motion Prior Distillation in Time Reversal Sampling for Generative Inbetweening: This paper proposes Motion Prior Distillation (MPD), an inference-time distillation method that distills motion residuals from the forward path into the backward path, fundamentally resolving the bidirectional motion prior conflict in time reversal sampling. MPD enables more coherent generative inbetweening without any additional training.
Multi-agent Coordination via Flow Matching: This paper proposes MAC-Flow, which first learns a centralized joint behavior distribution via Flow Matching, then distills it into decentralized single-step policies through IGM (Individual-Global-Max) decomposition combined with Q-value maximization for behavior-regularized training. Evaluated across 4 benchmarks, 12 environments, and 34 datasets, MAC-Flow achieves approximately 14.5× inference speedup over diffusion-based methods while maintaining coordination performance comparable to diffusion policies.
MVCustom: Multi-View Customized Diffusion via Geometric Latent Rendering and Completion: This paper introduces a new task termed multi-view customization and proposes the MVCustom framework, which leverages a video diffusion backbone with dense spatio-temporal attention for holistic frame consistency. At inference time, two novel techniques are introduced—depth-aware feature rendering and consistency-aware latent completion—achieving for the first time the simultaneous satisfaction of camera pose control, subject identity preservation, and cross-view geometric consistency.
Neon: Negative Extrapolation From Self-Training Improves Image Generation: Neon is proposed as a post-processing method requiring <1% additional training compute: the model is first fine-tuned on its own synthetic data (causing degradation), then negatively extrapolated away from the degraded weights. The paper proves that mode-seeking samplers cause anti-alignment between synthetic and real data gradients, so negative extrapolation is equivalent to optimizing toward the real data distribution. On ImageNet 256×256, xAR-L achieves SOTA FID of 1.02.
NeuralOS: Towards Simulating Operating Systems via Neural Generative Models: This paper proposes NeuralOS, a dual-component architecture combining an RNN-based state tracker and a diffusion renderer, which directly predicts graphical interface frame sequences from user input events (mouse movement/click/keyboard), achieving for the first time the simulation of an operating system via neural generative models.
Next Visual Granularity Generation: This paper proposes the Next Visual Granularity (NVG) generation framework, which decomposes images into structured sequences at different granularity levels and generates from global layout to fine-grained details progressively, achieving consistent FID improvements over the VAR family.
No Caption, No Problem: Caption-Free Membership Inference via Model-Fitted Embeddings: This paper proposes MoFit, the first membership inference attack (MIA) framework for diffusion models under a caption-free setting. By constructing surrogate images and conditional embeddings that overfit to the target model, MoFit exploits the asymmetric sensitivity of member samples to conditioning mismatch to enable effective inference.
Offline Reinforcement Learning with Generative Trajectory Policies: This paper proposes Generative Trajectory Policies (GTP), which adopts a unified perspective treating diffusion models, flow matching, and consistency models as special cases of ODE solution mappings. GTP learns a complete continuous-time trajectory solution mapping and introduces two adaptation techniques—score approximation and advantage weighting—achieving state-of-the-art performance on the D4RL benchmark.
Pareto-Conditioned Diffusion Models for Offline Multi-Objective Optimization: This paper proposes Pareto-Conditioned Diffusion (PCD), which reformulates offline multi-objective optimization as a conditional sampling problem. PCD directly generates high-quality solutions conditioned on objective trade-offs without requiring explicit surrogate models, achieving the best overall consistency across diverse benchmarks.
PCPO: Proportionate Credit Policy Optimization for Aligning Image Generation Models: This paper proposes PCPO, which corrects the disproportionate credit assignment inherent in policy gradient methods for diffusion/flow models via a stabilized objective reformulation and principled timestep reweighting, significantly accelerating convergence and mitigating model collapse.
PI-Light: Physics-Inspired Diffusion for Full-Image Relighting: This paper proposes π-Light (PI-Light), a two-stage full-image relighting framework. Stage 1 performs intrinsic property decomposition (albedo, normals, roughness, etc.) via a physics-guided diffusion model; Stage 2 synthesizes the relit image under target illumination via a physics-guided neural rendering module. Batch-aware attention and physics-inspired losses are introduced to achieve strong generalization to real-world scenes.
PolyGraph Discrepancy: a classifier-based metric for graph generation: This paper proposes PolyGraph Discrepancy (PGD), which approximates a variational lower bound on the Jensen-Shannon distance by training a classifier to distinguish real graphs from generated ones. PGD addresses three fundamental shortcomings of MMD-based metrics: the lack of an absolute scale, incomparability across descriptors, and high bias and variance under small sample sizes.
Pseudo-Nonlinear Data Augmentation: A Constrained Energy Minimization Viewpoint: Leveraging the dually flat structure of energy-based models and information geometry, this work proposes a training-free, efficient, and controllable data augmentation method that performs cross-modal augmentation on statistical manifolds via forward projection (encoding) and backward projection (decoding).
Purrception: Variational Flow Matching for Vector-Quantized Image Generation: This paper proposes Purrception, an image generation method that adapts Variational Flow Matching (VFM) to vector-quantized (VQ) latent spaces. By simultaneously computing a velocity field in the continuous embedding space and learning a categorical posterior distribution over codebook indices, Purrception bridges continuous transport dynamics with discrete supervision, achieving faster training convergence and FID scores competitive with state-of-the-art methods on ImageNet-1k 256×256.
Pyramidal Patchification Flow for Visual Generation: This paper proposes Pyramidal Patchification Flow (PPFlow), which employs larger patches at high-noise timesteps and smaller patches at low-noise timesteps, achieving 1.6–2.0× denoising speedup while preserving generation quality, without requiring any re-noising tricks.
QVGen: Pushing the Limit of Quantized Video Generative Models: This paper proposes QVGen, a quantization-aware training (QAT) framework for video diffusion models. It introduces auxiliary modules to reduce gradient norms and improve convergence, and designs a rank decay strategy to progressively eliminate the inference overhead of auxiliary modules during training. QVGen is the first method to achieve near full-precision video generation quality under 4-bit quantization.
RefAny3D: 3D Asset-Referenced Diffusion Models for Image Generation: This paper proposes RefAny3D, a 3D asset-referenced image generation framework that achieves precise geometric and texture consistency between generated images and 3D reference assets through a dual-branch generation strategy that jointly models RGB images and point maps.
Referring Layer Decomposition: This paper introduces the Referring Layer Decomposition (RLD) task, which predicts complete RGBA layers from a single RGB image given flexible user-provided prompts (spatial, textual, or hybrid). It also constructs the RefLade dataset comprising 1.1 million samples and proposes an automated evaluation protocol.
RIDER: 3D RNA Inverse Design with Reinforcement Learning-Guided Diffusion: This paper proposes RIDER, the first framework to incorporate reinforcement learning into 3D RNA inverse design. It first pretrains a conditional diffusion model (RIDE) to learn sequence–structure relationships, then applies RL fine-tuning to directly optimize 3D structural similarity rather than native sequence recovery rate, achieving over 100% improvement across all 3D self-consistency metrics.
RMFlow: Refined Mean Flow by a Noise-Injection Step for Multimodal Generation: This paper proposes RMFlow, which appends a noise-injection refinement step after 1-NFE MeanFlow transport to compensate for single-step transport errors, while incorporating a maximum likelihood objective during training to minimize the KL divergence between the learned and target distributions. RMFlow achieves near-SOTA 1-NFE results on text-to-image generation, molecular generation, and time-series generation.
RNE: plug-and-play diffusion inference-time control and energy-based training: This paper proposes the Radon-Nikodym Estimator (RNE), which exploits density ratios between path distributions to reveal the fundamental relationship between marginal densities and transition kernels, providing a unified plug-and-play framework that simultaneously enables diffusion density estimation, inference-time control, and energy-based diffusion training.
Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance: This paper proposes ProMoE, an MoE framework for Diffusion Transformers that introduces a two-stage router (conditional routing + prototype routing) and a routing contrastive loss to provide explicit semantic guidance, promoting expert specialization and significantly outperforming existing MoE and dense models on ImageNet.
SafeFlowMatcher: Safe and Fast Planning using Flow Matching with Control Barrier Functions: This paper proposes SafeFlowMatcher, a safe planning framework that integrates flow matching with Control Barrier Functions (CBF). Through a predictor-corrector (PC) integrator, it decouples trajectory generation from safety certification, providing formal safety guarantees while preserving the efficiency of flow matching.
Sample-Efficient Evidence Estimation of Score-Based Priors for Model Selection: This paper proposes DiME, a model evidence estimator that integrates along the temporal marginals of the diffusion posterior. DiME requires neither prior scores nor density evaluations, and accurately estimates model evidence under diffusion model priors using as few as 20 posterior samples, enabling prior selection and model validation.
scDFM: Distributional Flow Matching for Robust Single-Cell Perturbation Prediction: This paper proposes scDFM, a generative framework based on conditional flow matching (CFM) that enforces distribution-level fidelity via MMD regularization and introduces the PAD-Transformer backbone to handle noisy and sparse single-cell data. On combinatorial perturbation prediction, scDFM reduces MSE by 19.6% over the strongest baseline CellFlow.
Seek-CAD: A Self-Refined Generative Modeling for 3D Parametric CAD Using Local Inference via DeepSeek: This paper proposes Seek-CAD, the first training-free CAD parametric model generation framework based on a locally deployed reasoning LLM (DeepSeek-R1). It achieves self-refinement through the synergy of step-wise visual feedback and Chain-of-Thought (CoT), and introduces a novel SSR triplet design paradigm to support complex CAD model generation.
Self-Improving Loops for Visual Robotic Planning: This paper proposes SILVR, a framework that achieves continual self-improvement on unseen tasks by iteratively fine-tuning an in-domain video generation model on self-collected online trajectories. SILVR achieves up to 285% performance improvement on MetaWorld and real-robot benchmarks.
SeMoBridge: Semantic Modality Bridge for Efficient Few-Shot Adaptation of CLIP: SeMoBridge is proposed as a lightweight semantic modality bridge that maps image embeddings into the text modality, converting unreliable intra-modal (image-to-image) comparisons into reliable inter-modal (text-to-image) comparisons, achieving state-of-the-art few-shot classification performance with minimal training overhead.
SenseFlow: Scaling Distribution Matching for Flow-based Text-to-Image Distillation: This paper proposes SenseFlow, which scales distribution matching distillation (DMD) to large-scale flow-based text-to-image models (SD 3.5 Large 8B / FLUX.1 dev 12B) via Implicit Distribution Alignment (IDA) and Intra-Segment Guidance (ISG), enabling high-quality 4-step image generation.
SERUM: Simple, Efficient, Robust, and Unifying Marking for Diffusion-based Image Generation: SERUM is a watermarking method that injects unique watermark noise into the initial noise of diffusion models and trains a lightweight detector to identify watermarks directly from generated images — without costly DDIM inversion — achieving state-of-the-art detection rates under diverse attacks with extremely fast injection and detection, while supporting multi-user scenarios.
SMOTE and Mirrors: Exposing Privacy Leakage from Synthetic Minority Oversampling: This paper presents the first systematic study of privacy leakage in SMOTE, proposing two attacks—DistinSMOTE and ReconSMOTE—that demonstrate SMOTE is fundamentally non-privacy-preserving and disproportionately exposes minority-class records.
SoFlow: Solution Flow Models for One-Step Generative Modeling: This paper proposes Solution Flow Models (SoFlow), which directly learn the solution function $f(x_t, t, s)$ of the velocity ODE (mapping $x_t$ at time $t$ to the solution at time $s$). Trained from scratch via a Flow Matching loss combined with a JVP-free solution consistency loss, SoFlow achieves a 1-NFE FID of 2.96 on ImageNet 256 (XL/2), outperforming MeanFlow (3.43).
SongEcho: Towards Cover Song Generation via Instance-Adaptive Element-wise Linear Modulation: This paper proposes the SongEcho framework, which achieves cover song generation via Instance-Adaptive Element-wise Linear Modulation (IA-EiLM), generating new vocals and accompaniment while preserving the melodic contour of the original song.
SPEED: Scalable, Precise, and Efficient Concept Erasure for Diffusion Models: SPEED proposes a closed-form model editing method based on null space constraints, refining the preservation set through three complementary techniques—Influence-based Prior Filtering (IPF), Directional Prior Augmentation (DPA), and Invariance Equality Constraint (IEC)—to achieve scalable (erasing 100 concepts within 5 seconds), precise (zero semantic loss on non-target concepts), and efficient concept erasure.
SSG: Scaled Spatial Guidance for Multi-Scale Visual Autoregressive Generation: This paper proposes Scaled Spatial Guidance (SSG), a training-free inference-time guidance method that enhances the coarse-to-fine hierarchical generation quality of visual autoregressive models through frequency-domain prior construction and semantic residual amplification.
Steer Away From Mode Collisions: Improving Composition In Diffusion Models: To address concept missing and collision in multi-concept prompts for diffusion models, this paper proposes the "mode collision" hypothesis — that the modes of the joint distribution overlap with those of individual concept distributions — and introduces CO3 (Concept Contrasting Corrector). CO3 corrects sampling via a contrasting distribution $\tilde{p}(x|C) \propto p(x|C) / \prod_i p(x|c_i)$ in Tweedie mean space to steer away from degenerate modes, achieving plug-and-play, gradient-free, and model-agnostic compositional generation improvement.
Step-Aware Residual-Guided Diffusion for EEG Spatial Super-Resolution: This paper proposes SRGDiff, a step-aware residual-guided diffusion model that reformulates EEG spatial super-resolution as a dynamic conditional generation task, achieving high-fidelity reconstruction via per-step residual direction correction and timestep-dependent affine modulation.
Stochastic Self-Guidance for Training-Free Enhancement of Diffusion Models: This paper proposes S²-Guidance, which constructs a weak model by randomly dropping transformer block activations during denoising to perform self-guidance, correcting the suboptimal predictions of CFG without additional training. The method consistently outperforms CFG and other advanced guidance strategies on text-to-image and text-to-video tasks.
TAVAE: A VAE with Adaptable Priors Explains Contextual Modulation in the Visual Cortex: This paper extends the VAE formalism to propose the Task-Amortized VAE (TAVAE), which explains contextual modulation in the primary visual cortex (V1) by learning task-specific priors over an already-learned representation. The framework accounts for bimodal population responses observed when test stimuli deviate from training stimuli in an orientation discrimination task.
Temporal Concept Dynamics in Diffusion Models via Prompt-Conditioned Interventions: This paper proposes the PCI (Prompt-Conditioned Intervention) framework, which quantifies when concepts become committed during diffusion model denoising by switching text prompts at different timesteps along the denoising trajectory, and applies these findings to temporally-aware image editing.
Test-Time Iterative Error Correction for Efficient Diffusion Models: This paper proposes IEC (Iterative Error Correction), a plug-and-play test-time method that iteratively corrects inference errors in efficient diffusion models, reducing error accumulation from exponential to linear growth.
The Intricate Dance of Prompt Complexity, Quality, Diversity, and Consistency in T2I Models: This paper systematically investigates how prompt complexity affects three key dimensions of synthetic data generated by T2I models—quality, diversity, and consistency—proposes a new evaluation framework, and finds that prompt expansion, as an inference-time intervention, optimally balances diversity and aesthetic quality.
The Spacetime of Diffusion Models: An Information Geometry Perspective: This paper proposes a "spacetime" framework for diffusion models from an information-geometric perspective. It proves that the standard pullback geometry degenerates to straight lines in diffusion models, and introduces instead a spacetime geometry based on the Fisher-Rao metric, from which practically computable diffusion edit distances (DiffED) and transition path sampling methods are derived.
There and Back Again: On the Relation between Noise and Image Inversions in Diffusion Models: This paper conducts an in-depth analysis of the error mechanisms in DDIM inversion, revealing that latent encodings exhibit low diversity and high correlation in smooth image regions (e.g., sky), traces this phenomenon to inaccurate noise predictions in the early inversion steps, and proposes a simple fix that replaces the first few inversion steps with forward diffusion.
Towards Interpretable Visual Decoding with Attention to Brain Representations: This paper proposes NeuroAdapter, which segments fMRI signals into independent tokens by brain region and conditions Stable Diffusion directly via cross-attention, bypassing conventional CLIP/DINO intermediate embedding spaces. On NSD and other datasets, NeuroAdapter matches or surpasses existing methods on high-level semantic metrics. It further introduces the IBBI bidirectional interpretability framework, which for the first time dynamically reveals how different cortical regions drive image generation along the denoising trajectory.
Training-Free Reward-Guided Image Editing via Trajectory Optimal Control: This paper reformulates reward-guided image editing as a trajectory optimal control problem, treating the reverse process of diffusion/flow models as a controllable trajectory. By iteratively optimizing the entire trajectory via Pontryagin's Maximum Principle (PMP) with adjoint state methods, it achieves effective reward-guided editing without training and without reward hacking.
Translate Policy to Language: Flow Matching Generated Rewards for LLM Explanations: This paper proposes a general framework that leverages Rectified Flow to generate distributional rewards for training explanation-generating LLMs. By employing continuous normalizing flows (CNF) to capture the pluralistic and probabilistic nature of human judgments on explanations, the framework provides theoretical guarantees that CNF can effectively recover the true human reward distribution. It significantly outperforms RLHF/RLAIF baselines on SMAC, MMLU, MathQA, and other tasks.
TwinFlow: Realizing One-step Generation on Large Models with Self-adversarial Flows: TwinFlow is proposed: by extending the flow matching time interval from $[0,1]$ to $[-1,1]$, twin trajectories are constructed to form self-adversarial signals, enabling one-step generation without any discriminator or frozen teacher. This is the first work to scale 1-NFE generation to a 20B-parameter model (Qwen-Image), achieving 1-NFE GenEval of 0.86 — approaching the original 100-NFE score of 0.87 — while reducing inference cost by 100×.
Uni-X: Mitigating Modality Conflict with a Two-End-Separated Architecture for Unified Multimodal Models: Uni-X proposes an X-shaped architecture with separated ends and a shared middle to mitigate gradient conflicts between visual and textual modalities in Unified Multimodal Models (UMMs). By designating shallow and deep layers as modality-specific and sharing intermediate layers, a 3B-parameter model matches or surpasses 7B AR-UMMs on both image generation and multimodal understanding.
Unified Multi-Modal Interactive & Reactive 3D Motion Generation via Rectified Flow: DualFlow proposes the first unified framework for dyadic interactive/reactive 3D motion generation under text+music multi-modal conditions via Rectified Flow and Retrieval-Augmented Generation (RAG). It introduces contrastive flow matching and synchronization loss, achieving 2.5% FID improvement and 76% R-precision improvement on the MDD dataset, with 2.5× faster inference.
Unsupervised Conformal Inference: Bootstrapping and Alignment to Control LLM Uncertainty: This paper proposes an unsupervised conformal inference framework (BB-UCP) that achieves distribution-free finite-sample coverage guarantees for LLM generation under label-free, API-compatible conditions, via Gram matrix interaction energy scoring, batch bootstrap calibration, and conformal alignment, effectively detecting and filtering hallucinated outputs.
Verification of the Implicit World Model in a Generative Model via Adversarial Sequences: This paper proposes an adversarial sequence generation method to verify the soundness of implicit world models in generative sequence models. Through systematic evaluation in the chess domain using multiple adversarial strategies (IMO/BSO/AD), it finds that all tested models are unsound, while training objectives and dataset choice significantly affect soundness. Furthermore, linear board-state probes exhibit no causal role in most models.
Verifier-Constrained Flow Expansion for Discovery Beyond the Data: This paper proposes Flow Expander (FE), which expands the coverage of pretrained flow models in probability space via verifier-constrained entropy maximization, enabling the generation of design samples beyond the training data distribution while maintaining validity. FE increases diversity in molecular conformation design while preserving chemical validity.
VFScale: Intrinsic Reasoning through Verifier-Free Test-time Scalable Diffusion Model: VFScale proposes a test-time scalable diffusion model that requires no external verifier. By introducing an MRNCL loss and KL regularization to improve the energy landscape, the model's intrinsic energy function serves as a verifier. Combined with hybrid MCTS denoising for efficient search, a model trained on 6×6 mazes achieves 88% success on 15×15 mazes, where standard diffusion models fail entirely.
Visual Autoregressive Modeling for Instruction-Guided Image Editing: This paper proposes VAREdit, which reformulates instruction-guided image editing as a multi-scale prediction problem. By introducing a Scale-Aligned Reference module to address the scale mismatch in finest-scale conditioning, VAREdit significantly outperforms diffusion-based methods in both editing fidelity and inference efficiency.
When One Modality Rules Them All: Backdoor Modality Collapse in Multimodal Diffusion Models: This paper is the first to reveal and systematically study "backdoor modality collapse" in multimodal diffusion models — a phenomenon where the backdoor effect degenerates to rely on a single modality (typically text) during multimodal backdoor attacks. Two novel Shapley-value-based metrics, TMA and CTI, are proposed to quantify modality contribution and cross-modal interaction, uncovering a "winner-takes-all" dynamic and negative interaction.
When Scores Learn Geometry: Rate Separations under the Manifold Hypothesis: Under the manifold hypothesis, this paper reveals a scale separation between geometric and distributional information in score learning — manifold geometry contributes at order $\Theta(\sigma^{-2})$, which dominates distributional information by a factor of $O(\sigma^{-2})$. This establishes that the success of diffusion models stems primarily from learning the data manifold rather than the full distribution, and a one-line code modification suffices to generate the uniform distribution on the manifold.
Zatom-1: A Multimodal Flow Foundation Model for 3D Molecules and Materials: Zatom-1 is the first end-to-end fully open-source foundation model that unifies generative modeling and property prediction for 3D molecules and materials via multimodal flow matching. Using a standard Transformer architecture, it directly models discrete atom types and continuous 3D geometry in Euclidean space, achieving positive transfer learning across chemical domains.

🎮 Reinforcement Learning¶

A Unifying View of Coverage in Linear Off-Policy Evaluation: This paper proposes a novel coverage parameter—feature-dynamics coverage—and conducts a new finite-sample analysis of the classical LSTDQ algorithm through an instrumental variable lens, unifying the various fragmented coverage definitions in linear off-policy evaluation.
AbstRaL: Augmenting LLMs' Reasoning by Reinforcing Abstract Thinking: This paper proposes AbstRaL, which uses reinforcement learning to teach LLMs to construct mathematical abstractions of reasoning problems (replacing concrete numbers/names with symbolic variables and extracting general formulas), then employs a symbolic solver to derive answers. AbstRaL nearly eliminates performance degradation caused by distribution shift on GSM perturbation benchmarks, and also yields implicit improvements on OOD mathematical and general reasoning tasks.
AMPED: Adaptive Multi-objective Projection for balancing Exploration and skill Diversification: This paper proposes AMPED, a framework that applies gradient surgery (PCGrad) during skill pretraining to balance gradient conflicts between exploration (entropy + RND) and skill diversity (AnInfoNCE), and employs a SAC-based skill selector during fine-tuning to adaptively choose the optimal skill. AMPED outperforms SBRL baselines including DIAYN, CeSD, and CIC on Maze and URLB benchmarks.
APPLE: Toward General Active Perception via Reinforcement Learning: This paper proposes APPLE, a general active perception framework that combines reinforcement learning with supervised learning. Active perception is formulated as a POMDP, with the reward defined as the RL reward minus the prediction loss. The gradient naturally decomposes into a policy gradient term and a prediction loss gradient term. Built upon off-policy algorithms (SAC/CrossQ) and a shared ViViT backbone, the framework is validated across 5 diverse task benchmarks. The CrossQ variant requires no per-task hyperparameter tuning and achieves a 53% improvement in training efficiency.
ARM-FM: Automated Reward Machines via Foundation Models for Compositional Reinforcement Learning: This paper proposes ARM-FM, a framework that leverages foundation models (e.g., GPT-4o) to automatically generate Language-Aligned Reward Machines (LARMs) from natural language task descriptions — encompassing the automaton structure, executable label functions, and per-state natural language descriptions — providing RL agents with compositional dense reward signals. The framework successfully solves sparse-reward long-horizon tasks that standard RL completely fails to learn, across environments including MiniGrid, Craftium (3D Minecraft), and Meta-World, while achieving zero-shot task generalization.
AutoQD: Automatic Discovery of Diverse Behaviors with Quality-Diversity Optimization: AutoQD is proposed to embed policy occupancy measures into a finite-dimensional space via random Fourier features (RFF), followed by weighted PCA for dimensionality reduction to obtain behavior descriptors, enabling QD optimization without manually designed BDs. It comprehensively outperforms hand-crafted BDs and existing unsupervised QD methods across 6 continuous control tasks.
AutoQD: Automatic Discovery of Diverse Behaviors with Quality-Diversity Optimization: This paper proposes AutoQD, which automatically generates behavior descriptors by embedding policy occupancy measures via random Fourier features, enabling the discovery of diverse, high-quality policies in continuous control tasks without manual descriptor design. Effectiveness is demonstrated across 6 standard environments.
AutoTool: Automatic Scaling of Tool-Use Capabilities in RL via Decoupled Entropy Constraints: This paper proposes a reinforcement learning framework with Decoupled Adaptive Entropy Constraints, enabling LLMs to automatically switch between long and short reasoning modes based on problem difficulty in tool-calling tasks, achieving a 9.8% accuracy improvement while reducing inference token overhead by approximately 81%.
AutoTool: Automatic Scaling of Tool-Use Capabilities in RL via Decoupled Entropy Constraints: This paper proposes AutoTool, which addresses reasoning collapse in direct RL training for LLM tool use and the overthinking problem in scaled models via a decoupled adaptive entropy constraint strategy. AutoTool enables automatic switching between long and short reasoning modes based on problem difficulty, achieving a 9.8% accuracy improvement while reducing reasoning token overhead by ~81%.
AWM: Accurate Weight-Matrix Fingerprint for Large Language Models: AWM is a training-free LLM weight-matrix fingerprinting method that recovers permutation and sign-flip transformations in the embedding layer via the Linear Assignment Problem (LAP), and then applies unbiased CKA to neutralize orthogonal transformations in Q/K matrices. It achieves perfect AUC (1.0) on 150 LLM pairs, is robust to six categories of post-training (SFT, continued pretraining up to 5.5T tokens, RL, multimodal extension, pruning, and upcycling), and completes within 30 seconds.
BA-MCTS: Bayes Adaptive Monte Carlo Tree Search for Offline Model-based RL: This work is the first to introduce Bayes Adaptive MDPs (BAMDPs) into offline model-based RL. It proposes Continuous BAMCP to handle Bayesian planning in continuous state/action spaces, combines pessimistic reward penalization with search-based policy iteration (an "RL + Search" paradigm), achieves significant improvements over 19 baselines on 12 D4RL tasks (Cohen's $d > 1.8$), and demonstrates successful application to tokamak fusion control.
Boolean Satisfiability via Imitation Learning: This paper proposes ImitSAT, the first imitation learning-based branching heuristic for CDCL solvers. By compressing solver runs into conflict-free KeyTrace expert sequences and framing branching decisions as an autoregressive prediction task conditioned on the decision prefix, ImitSAT significantly reduces propagation counts and solving time under a small query budget, and demonstrates strong generalization to structured SAT benchmarks.
Breaking Barriers: Do Reinforcement Post Training Gains Transfer To Unseen Domains?: Through an observational study (18 open-source RPT models) and an interventional study (single-domain GRPO training), this paper systematically reveals the generalization limitations of Reinforcement Post-Training (RPT/RLVR): RPT yields substantial within-domain gains, but cross-domain generalization is inconsistent — structured domains (math ↔ code) exhibit mutual transfer, whereas gains do not generalize to unstructured domains (law/finance/medicine). This finding holds consistently across algorithms, model scales, and training steps.
Chain-of-Context Learning: Dynamic Constraint Understanding for Multi-Task VRPs: This paper proposes Chain-of-Context Learning (CCL), which achieves stepwise dynamic constraint-aware decoding via Relevance-Guided Context Reformulation (RGCR, adaptively aggregating constraint information to construct context) and Trajectory-Shared Node Re-embedding (TSNR, sharing node updates across trajectories to avoid redundant computation). CCL comprehensively outperforms existing methods across 48 VRP variants (16 in-distribution + 32 out-of-distribution).
Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models: Co-rewarding proposes a self-supervised RL framework that addresses training collapse in self-rewarding RL through two complementary supervision perspectives: a data-side mechanism (cross-view consistency via contrastive paraphrased questions) and a model-side mechanism (EMA teacher model providing pseudo-labels). Without any human annotations, the framework matches or surpasses RLVR (with ground-truth labels) across multiple mathematical reasoning benchmarks.
Continuous-Time Value Iteration for Multi-Agent Reinforcement Learning: This paper proposes VIP (Value Iteration via PINN), the first framework to apply Physics-Informed Neural Networks (PINNs) for solving HJB PDEs in continuous-time multi-agent reinforcement learning. A Value Gradient Iteration (VGI) module is introduced to iteratively refine value gradients. VIP consistently outperforms both discrete-time and continuous-time baselines on continuous-time MPE and MuJoCo multi-agent tasks.
Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning: CalibRL reframes expert data as a distribution calibration baseline (rather than a strict imitation target), and achieves fine-grained control over the exploration–exploitation trade-off in MLLM reasoning training via asymmetric LeakyReLU activation combined with advantage weighting. This addresses entropy collapse in RLVR and substantially outperforms GRPO/DAPO on tasks such as geometric reasoning.
Cross-Embodiment Offline Reinforcement Learning for Heterogeneous Robot Datasets: This paper systematically investigates cross-embodiment offline RL pretraining, identifies gradient conflicts leading to negative transfer under increasing suboptimal data ratios and robot diversity, and proposes Embodiment Grouping (EG)—a strategy that clusters robots by morphological graph distance and updates the actor group-wise. On a locomotion benchmark spanning 16 robot platforms, EG substantially mitigates negative transfer (IQL+EG improves over IQL by 34% on the 70% suboptimal dataset).
CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning: This paper presents CUDA-L1, a three-stage pipeline framework based on Contrastive Reinforcement Learning (Contrastive RL), which trains an LLM with initially weak CUDA capabilities into an effective CUDA optimizer. The framework achieves an average 3.12× speedup across 250 CUDA kernels on KernelBench, with a peak speedup of 120×, and generalizes across GPU architectures.
Deep SPI: Safe Policy Improvement via World Models: This paper establishes a theoretical framework for Safe Policy Improvement (SPI) that unifies world models and representation learning with policy update guarantees: an importance-ratio-based neighborhood operator constrains policy updates to ensure monotonic improvement and convergence; local transition/reward losses control world model quality and representation stability. The proposed DeepSPI algorithm matches or surpasses PPO and DeepMDP on the ALE-57 benchmark.
Distributionally Robust Cooperative Multi-Agent Reinforcement Learning via Robust Value Factorization: This paper proposes the Distributionally Robust IGM (DrIGM) principle, integrating distributionally robust optimization into the value factorization framework of cooperative multi-agent RL, enabling classical methods such as VDN, QMIX, and QTRAN to maintain robust decentralized execution performance under distribution shift between training and deployment environments.
DiVE-k: Differential Visual Reasoning for Fine-grained Image Recognition: This paper proposes DiVE-k, a framework that constructs multiple-choice questions (MCQs) from the top-k outputs of a large vision-language model (LVLM) and trains the model via GRPO reinforcement learning to perform differential visual reasoning, achieving substantial improvements in base-to-novel generalization for fine-grained image recognition.
Divide, Harmonize, Then Conquer It: Shooting Multi-Commodity Flow Problems with Multimodal Language Models: This paper proposes Pram, the first framework to leverage multimodal language models (MLMs) for solving multi-commodity flow (MCF) problems. It decomposes the original problem into subproblems via partitioning and employs multi-agent reinforcement learning (MARL) to coordinate global consistency across subproblems. Theoretical convergence to the optimal solution is proven, and empirical results show that Pram is 1–2 orders of magnitude faster than LP solvers while achieving near-optimal performance.
Don't Just Fine-tune the Agent, Tune the Environment: This paper proposes the Environment Tuning training paradigm, which enables LLM agents to learn complex multi-turn tool use from scratch using only 400 training samples, through structured curriculum learning, actionable environment-augmented feedback, and fine-grained progress rewards, while achieving strong out-of-distribution generalization.
Dual-Robust Cross-Domain Offline Reinforcement Learning Against Dynamics Shifts: This work is the first to simultaneously address train-time robustness (source–target domain dynamics mismatch) and test-time robustness (deployment-environment dynamics shift) in cross-domain offline RL. The proposed DROCO algorithm centers on the Robust Cross-Domain Bellman (RCB) operator—applying a robust Bellman update to source-domain data and a standard in-sample update to target-domain data—and reformulates intractable dynamics uncertainty as state-space perturbations via dual reconstruction. On the D4RL benchmark, DROCO achieves a total score of 1105.2, surpassing the second-best method by 14%, while exhibiting performance degradation under hard-level dynamics perturbations that is only half that of the baselines.
Dual Goal Representations: This paper proposes dual goal representations, which encode a goal state via the set of optimal temporal distances from all states to that goal. The authors theoretically prove that this representation is sufficient for recovering the optimal policy and naturally filters exogenous noise. A practical learning algorithm based on asymmetric inner product parameterization is designed, and the resulting module consistently improves three mainstream offline GCRL methods across 20 OGBench tasks as a plug-and-play component.
DVLA-RL: Dual-Level Vision-Language Alignment with Reinforcement Learning Gating for Few-Shot Learning: This paper proposes DVLA-RL, a framework that employs Dual-level Semantic Construction (DSC) to generate complementary low-level attributes and high-level descriptions, and uses RL-gated Attention (RLA) to dynamically balance the contributions of self-attention and cross-attention across different network layers. This achieves hierarchical vision-language alignment from low to high levels, attaining state-of-the-art performance on 9 few-shot learning benchmarks.
Echo: Towards Advanced Audio Comprehension via Audio-Interleaved Reasoning: This paper proposes a novel paradigm called audio-interleaved reasoning, which treats audio as an active component during inference rather than a static context, enabling LALMs to dynamically locate and re-listen to audio segments during the reasoning process. Through a two-stage SFT+RL training framework and a structured data generation pipeline, the authors build the Echo model, which surpasses GPT-4o and Gemini-2.0-Flash on both expert-level and general audio understanding benchmarks.
Efficient Estimation of Kernel Surrogate Models for Task Attribution: This paper proposes a kernel surrogate model (KernelSM) for task attribution. By employing RBF kernel ridge regression to capture nonlinear interaction effects among tasks, combined with a gradient-projection-based efficient estimation algorithm that eliminates repeated retraining, KernelSM achieves a 25% improvement in correlation over linear surrogate and influence function baselines across mathematical reasoning, in-context learning, and multi-objective RL settings.
EGG-SR: Embedding Symbolic Equivalence into Symbolic Regression via Equality Graph: This paper proposes Egg-SR, a unified framework that embeds symbolic equivalence via equality graphs (e-graphs) into three categories of symbolic regression methods—MCTS, DRL, and LLM—achieving subtree pruning, policy gradient variance reduction, and feedback prompt enrichment, respectively. Theoretical results prove that Egg-MCTS tightens the regret bound and Egg-DRL reduces gradient estimation variance, while experiments consistently validate improved expression discovery accuracy.
Emergence of Spatial Representation in an Actor-Critic Agent with Hippocampus-Inspired Sequence Generator: Inspired by the intrinsic recurrent circuitry of hippocampal region CA3, this paper proposes a minimal sequence generator (shift register) integrated with an actor-critic framework to achieve maze navigation under sparse visual input, while giving rise to neurobiologically observed phenomena including place fields, DG orthogonalization, distance-dependent spatial kernels, and task-dependent remapping.
Entropy-Preserving Reinforcement Learning (REPO / ADAPO): This paper identifies the theoretical root cause of systematic policy entropy collapse in policy gradient RL algorithms for LLM post-training — namely, the positive correlation between advantage functions and log-probabilities — and proposes two complementary solutions: REPO (decorrelating the advantage function) and ADAPO (adaptive asymmetric clipping), achieving state-of-the-art performance on interactive tool-use tasks.
ExGRPO: Learning to Reason from Experience: This paper presents the first systematic study of what types of reasoning experiences are most valuable for RLVR, finding that medium-difficulty problems paired with low-entropy trajectories are most effective. Based on these findings, it proposes the ExGRPO framework for experience management and mixed-policy optimization, achieving an average gain of +3.5 points on mathematical reasoning and +7.6 points on general reasoning.
Exploration vs Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward: Through theoretical derivation and cross-model experiments, this paper demonstrates that the learning signal provided by clipping bias in RLVR is negligible (≤1/17); the true effect of clipping is an implicit entropy compression on the policy. A reward mislabeling model is further proposed to explain why random rewards can benefit stronger models.
FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning: To address the problem of flawed-positive rollouts in RLVR training—where the model reaches a correct answer through unreliable reasoning—this paper proposes the FAPO algorithm. FAPO employs a GenRM to detect flawed reasoning and applies a parameter-free reward penalty mechanism that realizes a natural "exploit-then-suppress" learning trajectory, simultaneously improving outcome correctness, process reliability, and training stability.
Flow Actor-Critic for Offline Reinforcement Learning (FAC): FAC is the first method to jointly leverage a continuous normalizing flow model to simultaneously construct an expressive actor policy and a critic penalty mechanism based on exact density estimation. By identifying OOD regions for selective conservative Q-value estimation, FAC achieves an average score of 60.3 across 55 OGBench tasks, substantially outperforming the previous best of 43.6.
From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning: This work discovers that the reasoning performance of multimodal LLMs is highly correlated with the Visual Attention Score (VAS) ($r=0.96$), and proposes the AVAR framework, which improves VAS through three stages—visual-anchored data synthesis, attention-guided training objectives, and visual-anchored reward shaping—achieving an average improvement of 7% across 77 benchmarks.
From Observations to Events: Event-Aware World Model for Reinforcement Learning: This paper proposes the Event-Aware World Model (EAWM), a general framework that automatically generates events from raw observations and learns event-aware representations without manual annotations, improving existing MBRL baselines by 10%–45% and achieving new state-of-the-art results on Atari 100K, Craftax 1M, DeepMind Control 500K, and DMC-GB2 500K.
From Verifiable Dot to Reward Chain: Harnessing Verifiable Reference-based Rewards for RL of Open-ended Generation: This paper proposes RLVRR, a framework that extends RLVR (reinforcement learning with verifiable rewards) from mathematical/code reasoning to open-ended text generation. It extracts hierarchical keyword sequences (content rewards) and executable Python checking functions (style rewards) from high-quality reference answers, forming a "reward chain" to replace single-point verification signals. On 10+ benchmarks, RLVRR trained on 10K examples outperforms 100K SFT and advanced reward models.
GraphOmni: A Comprehensive and Extensible Benchmark Framework for Large Language Models on Graph-theoretic Tasks: This paper proposes GraphOmni, a benchmark framework that systematically evaluates the graph-theoretic reasoning capabilities of 11 LLMs across 241K queries spanning 7 graph types × 7 serialization formats × 9 prompting strategies, reveals complex interaction effects among these three dimensions, and introduces an RL-guided combinatorial search method that achieves approximately 90% of optimal accuracy at roughly 25% of the evaluation cost.
Helix: Evolutionary Reinforcement Learning for Open-Ended Scientific Problem Solving: This paper proposes HELIX, a framework that integrates reinforcement learning (GRPO) with evolutionary algorithms (NSGA-II) for open-ended scientific problem solving. RL iteratively optimizes the policy, evolutionary mechanisms balance solution quality and diversity, and in-context learning leverages historical solutions to guide exploration. Using only a 14B model, HELIX surpasses GPT-4o pipelines across 20 tasks spanning circle packing, machine learning optimization, and more.
How Far Can Unsupervised RLVR Scale LLM Training?: This paper presents a comprehensive analysis of Unsupervised Reinforcement Learning from Verifiable Rewards (URLVR), demonstrating that all intrinsic reward methods fundamentally operate as a "sharpening" mechanism over the model's initial distribution, leading to an inevitable rise-then-fall collapse pattern. It proposes the Model Collapse Step as a prior-based model indicator and identifies external reward methods as the key direction for overcoming scalability bottlenecks.
How LLMs Learn to Reason: A Complex Network Perspective: This paper proposes a "sparse concept network" theory from a complex network perspective to provide a unified explanation of four puzzling phenomena in RLVR training (V-shaped response length, two-stage learning curve, catastrophic forgetting, and policy collapse). It reveals that all four phenomena originate from the topological self-organization of sparse reasoning graphs with average degree approximately 2, and derives the Annealed-RLVR algorithm, which surpasses standard RLVR on mathematical reasoning benchmarks.
InFOM: Intention-Conditioned Flow Occupancy Models: InFOM learns a latent intention encoder via variational inference and models intention-conditioned discounted state occupancy measures using flow matching, enabling efficient pre-training and fine-tuning in RL. It achieves 1.8× median return and 36% higher success rate over baselines across 36 state-based tasks and 4 image-based tasks.
Is Pure Exploitation Sufficient in Exogenous MDPs with Linear Function Approximation?: This paper proves that pure exploitation (without exploration) suffices to achieve sublinear regret in Exogenous MDPs (Exo-MDPs), where uncertainty arises solely from exogenous inputs independent of agent actions. In the tabular setting, the PTO algorithm attains $\tilde{O}(H^2|\Xi|\sqrt{K})$ regret; under linear function approximation, the LSVI-PE algorithm achieves regret that is polynomial in the feature dimension and the exogenous state space, yet independent of the endogenous state and action spaces.
LadderSym: A Multimodal Interleaved Transformer for Music Practice Error Detection: This paper proposes the LadderSym architecture for music practice error detection. It addresses insufficient cross-stream alignment in late-fusion approaches via an interleaved cross-stream alignment module (Ladder), and reduces frequency ambiguity in audio-only score representations by incorporating symbolic score prompts (Sym). On MAESTRO-E, the missed-note F1 score improves from 26.8% to 56.3%.
Latent Wasserstein Adversarial Imitation Learning: LWAIL leverages ICVF to learn a dynamics-aware latent representation from a small amount of random data, replacing the Euclidean ground metric in Wasserstein-based imitation learning with a latent-space distance. The method achieves expert-level imitation performance using only a single state-only expert trajectory.
Learning from Synthetic Data Improves Multi-hop Reasoning: This paper finds that RLVR training on fully fictitious, rule-generated synthetic data significantly improves LLM performance on real-world multi-hop reasoning tasks (56%–131% gains for Qwen3-0.6B), because the model learns knowledge composition as a generalizable reasoning skill rather than memorizing factual knowledge.
Learning to Generate Unit Test via Adversarial Reinforcement Learning: This paper proposes UTRL, a framework that iteratively trains a unit test generator and a code generator via adversarial RL — the test generator learns to produce discriminative test cases that distinguish LLM-generated code from correct solutions, while the code generator learns to pass those tests. A Qwen3-4B model trained with UTRL surpasses GPT-4.1 in test generation quality.
Learning to Orchestrate Agents in Natural Language with the Conductor: A 7B Qwen2.5 model is trained via GRPO as a "Conductor" that outputs complete agent workflows in natural language—comprising subtask instructions, worker assignments, and communication topology access lists—to coordinate frontier models such as GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro. Trained on only 960 questions × 200 iterations, the Conductor achieves an average accuracy of 77.27% across 7 reasoning benchmarks, surpassing all single-model baselines (GPT-5: 74.78%) and multi-agent baselines.
Learning to Play Multi-Follower Bayesian Stackelberg Games: This paper provides the first systematic study of online learning in multi-follower Bayesian Stackelberg Games (BSGs). By geometrically partitioning the leader's strategy space into best-response regions, it achieves a regret bound of $\tilde{O}(\sqrt{\min\{L, nK\} \cdot T})$ under type feedback — a bound that does not grow polynomially in the number of followers $n$ — and establishes a nearly matching lower bound of $\Omega(\sqrt{\min\{L, nK\}T})$.
Less is More: Clustered Cross-Covariance Control for Offline RL: This paper identifies that the standard squared-error TD objective introduces harmful cross-covariance in offline RL, and proposes C⁴ (Clustered Cross-Covariance Control for TD), which mitigates this effect via partitioned buffer sampling and an explicit gradient-based corrective penalty, achieving up to 30% return improvement in small-dataset and OOD-dominated settings.
LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards: This paper proposes LongRLVR, which introduces verifiable context rewards into RLVR training to address the gradient vanishing problem of contextual grounding caused by relying solely on final-answer rewards in long-context settings, significantly improving LLM long-context reasoning capabilities.
LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning: This paper proposes LongWriter-Zero: starting from a base model, without relying on any annotated or synthetic data, the approach uses GRPO reinforcement learning combined with a three-dimensional composite reward model (length / quality / format) to elicit emergent ultra-long, high-quality text generation. With 32B parameters, the model surpasses 100B+ models such as DeepSeek-R1 and Qwen3-235B on WritingBench.
LoongRL: Reinforcement Learning for Advanced Reasoning over Long Contexts: This paper proposes LoongRL, which constructs KeyChain synthetic data for reinforcement learning training to elicit a plan–retrieve–reason–recheck reasoning pattern in LLMs for long-context tasks. Training solely on 16K contexts generalizes to 128K; the 14B model achieves 74.2, approaching o3-mini (74.5) and DeepSeek-R1 (74.9).
MARS-Sep: Multimodal-Aligned Reinforced Sound Separation: MARS-Sep reformulates query-conditioned sound separation as a reinforcement learning problem, performing stochastic decisions over time-frequency bins via a factorized Beta mask policy, and leverages a progressively aligned multimodal encoder to provide semantic reward signals, achieving simultaneous improvements in signal fidelity and semantic consistency.
Menlo: From Preferences to Proficiency – Evaluating and Modeling Native-like Quality Across 47 Languages: This paper proposes the Menlo framework, which decomposes native-like response quality into four dimensions grounded in audience design theory, constructs a preference dataset of 6,423 annotated pairs covering 47 language varieties, and demonstrates that pairwise evaluation combined with RL-trained LLM judges can approach human annotator agreement levels.
MergeMix: A Unified Augmentation Paradigm for Visual and Multi-Modal Understanding: MergeMix proposes a token merging–based mixup data augmentation method that generates mixed images in attention space via bipartite soft matching, uses the mixing ratio as a soft margin in preference optimization, and unifies SFT and RL training paradigms across image classification and multimodal large language model settings.
Metis-SPECS: Decoupling Multimodal Learning via Self-distilled Preference-based Cold Start: This paper proposes SPECS, a three-stage cold-start framework that (1) generates preference data via self-distillation (distinguishing only format differences), (2) applies DPO for format pre-alignment as the cold start, and (3) follows with GRPO fine-tuning. By decoupling format learning from reasoning learning, SPECS achieves consistent performance gains of +4.1% on MEGA-Bench and +12.2% on MathVista.
ROMI: Model-based Offline RL via Robust Value-Aware Model Learning with Implicitly Differentiable Adaptive Weighting: ROMI achieves robust value-aware model learning by converting the dynamics uncertainty set into a state uncertainty set via Wasserstein duality, and employs an implicitly differentiable adaptive weighting mechanism to balance dynamics accuracy against value-awareness. This resolves the Q-value underestimation and gradient explosion issues in RAMBO, achieving state-of-the-art performance among model-based offline RL methods on D4RL and NeoRL.
Model Predictive Adversarial Imitation Learning for Planning from Observation: This paper proposes MPAIL (Model Predictive Adversarial Imitation Learning), which embeds an MPPI planner natively into the adversarial imitation learning loop, achieving the first end-to-end Planning-from-Observation (PfO) framework. MPAIL comprehensively outperforms policy-based AIL methods in generalization, robustness, interpretability, and sample efficiency, and is successfully deployed on a real-world robot navigation task from a single observed demonstration.
MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation: MoMaGen formulates demonstration data generation for bimanual mobile manipulation as a constrained optimization problem. By combining hard constraints (reachability, collision-free motion, visibility) with soft constraints (object visibility during navigation, retraction to compact poses), the framework automatically generates large-scale, diverse datasets from a single human teleoperation demonstration. The resulting visuomotor policy can be deployed on a physical robot with only 40 real demonstrations for fine-tuning.
MVR: Multi-view Video Reward Shaping for Reinforcement Learning: This paper proposes the MVR framework, which learns a state relevance function from multi-view video via video-text similarity matching. Combined with state-dependent reward shaping that automatically attenuates VLM guidance, MVR outperforms existing VLM-based reward methods across 19 tasks on HumanoidBench and MetaWorld.
Near-Optimal Second-Order Guarantees for Model-Based Adversarial Imitation Learning: This paper proposes MB-AIL (Model-Based Adversarial Imitation Learning), establishing horizon-free second-order sample complexity upper bounds under general function approximation. Combined with information-theoretic lower bounds on constructed hard instances, MB-AIL is shown to be minimax optimal (up to logarithmic factors) in terms of online interaction sample complexity.
Nearly-Optimal Bandit Learning in Stackelberg Games with Side Information: By linearizing the leader's utility space in Stackelberg games, this paper proposes a reduction to linear contextual bandits that improves the regret bound from $\tilde{O}(T^{2/3})$ to the nearly-optimal $\tilde{O}(T^{1/2})$ under bandit feedback with side information.
Offline Reinforcement Learning with Generative Trajectory Policies: This paper proposes the Generative Trajectory Policy (GTP), which unifies diffusion models, flow matching, and consistency models by learning the complete solution mapping of an ODE. Combined with two key adaptation techniques—score approximation and value-guided weighting—GTP achieves state-of-the-art performance on D4RL.
On Discovering Algorithms for Adversarial Imitation Learning: This paper proposes DAIL — the first meta-learned adversarial imitation learning algorithm. It decomposes AIL into two stages (density ratio estimation and reward assignment), and employs LLM-guided evolutionary search to automatically discover an optimal reward assignment (RA) function $r_{\text{disc}}$, achieving generalization to unseen environments and policy optimizers while surpassing all manually designed baselines.
On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification: This paper mathematically proves, from an RL policy gradient perspective, that SFT gradients implicitly encode a pathological reward structure with inverse probability weighting ($1/\pi_\theta$), causing excessively large gradients on low-probability tokens and limiting generalization. The paper proposes DFT (Dynamic Fine-Tuning), which requires only a one-line code modification (multiplying the CE loss by the token probability: $-p\log p$) to eliminate inverse probability weighting. DFT substantially outperforms SFT on mathematical reasoning, code generation, and multimodal tasks, and even surpasses GRPO/PPO in the offline RL setting.
On the $O(1/T)$ Convergence of Alternating Gradient Descent-Ascent in Bilinear Games: This paper provides the first proof that alternating gradient descent-ascent (AltGDA) converges to a Nash equilibrium at an $O(1/T)$ rate in constrained bilinear zero-sum games (when an interior NE exists), outperforming simultaneous GDA's $O(1/\sqrt{T})$ rate. The analysis characterizes the "friction" effect produced when trajectories collide with the boundary via an energy function decay argument, and further optimizes step sizes through performance estimation programming (PEP).
One Model for All Tasks: Leveraging Efficient World Models in Multi-Task Planning: This paper proposes ScaleZero, which incorporates a Mixture-of-Experts (MoE) architecture into a unified world model to address gradient conflict and plasticity collapse in multi-task learning. Combined with a Dynamic Parameter Scaling (DPS) strategy that adaptively allocates model capacity, a single multi-task model achieves performance comparable to single-task expert models across three benchmarks (Atari/DMC/Jericho) while reducing environment interactions by approximately 28.5%.
Online Minimization of Polarization and Disagreement via Low-Rank Matrix Bandits: This paper is the first to formalize the problem of minimizing polarization and disagreement under the Friedkin-Johnsen opinion dynamics model as an online low-rank matrix bandit problem (OPD-Min). A two-phase algorithm, OPD-Min-ESTR, is proposed that reduces the dimensionality from $|V|^2$ to $O(|V|)$ via subspace estimation, achieving substantial improvements over full-dimensional linear bandit baselines on both synthetic and real-world networks.
Online Prediction of Stochastic Sequences with High Probability Regret Bounds: This paper revisits the classical problem of universal prediction of stochastic sequences over a finite horizon $T$, and establishes, for the first time, vanishing regret bounds that hold with high probability in the form $O(T^{-1/2}\delta^{-1/2})$. These bounds closely mirror the existing expected regret bound of $O(T^{-1/2})$, and the paper further proves that the exponent of $\delta$ cannot be improved without additional assumptions.
Optimistic Task Inference for Behavior Foundation Models: This paper proposes OpTI-BFM — a test-time task inference method for Behavior Foundation Models that requires neither a complete reward function nor an annotated dataset, and recovers oracle performance within approximately 5 episodes of environment interaction. The core insight is to exploit the linear structure of successor features to reduce task inference to a linear bandit problem, employing a UCB strategy for optimistic exploration in task embedding space, with formal regret guarantees.
P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling: This paper proposes P-GenRM, the first personalized generative reward model. Through a three-stage training pipeline—PSI supervised fine-tuning to construct structured evaluation chains, CRE reinforcement learning to enhance reasoning under missing preference signals, and hard-negative curriculum learning to improve robustness—P-GenRM converts mixed preference signals into context-adaptive user personas and scoring rubrics. At inference time, a dual-granularity test-time scaling strategy is introduced: individual-level multi-sample aggregation and prototype-level collaborative filtering that borrows preferences from similar users. P-GenRM surpasses the previous SOTA by 2.31% on PersonalRewardBench, with test-time scaling yielding an additional ~3% gain, while generalizing to unseen users.
ParaS2S: Benchmarking and Aligning Spoken Language Models for Paralinguistic-Aware Speech-to-Speech Interaction: This paper proposes the ParaS2S framework, which comprises ParaS2SBench — a benchmark for evaluating paralinguistic-aware (emotion/sarcasm/age/gender) speech-to-speech interaction — and ParaS2SAlign, a GRPO-based RL alignment framework that enables S2S models to learn style-adaptive response generation with minimal labeled data.
Partially Equivariant Reinforcement Learning in Symmetry-Breaking Environments: This paper proposes the Partially Invariant MDP (PI-MDP) framework, which employs a learnable gating function $\lambda(s,a)$ to pointwise switch between equivariant and standard Bellman updates across the state-action space. The paper theoretically proves that local symmetry breaking propagates through discounted backup and amplifies global value function error by a factor of $1/(1-\gamma)$, while PI-MDP provably confines the error strictly within the breaking region. The framework is instantiated as PE-DQN and PE-SAC, achieving comprehensive improvements over strictly equivariant and approximately equivariant baselines on Grid-World, MuJoCo locomotion, and robotic manipulation tasks.
PolicyFlow: Policy Optimization with Continuous Normalizing Flow in Reinforcement Learning: PolicyFlow seamlessly integrates a continuous normalizing flow (CNF) policy into the PPO framework: it approximates the importance ratio via velocity field differences along an interpolated path (avoiding full ODE path backpropagation), and introduces a Brownian motion-inspired implicit entropy regularizer to prevent mode collapse. The method matches or surpasses Gaussian PPO and flow-based baselines (FPO/DPPO) across MultiGoal, PointMaze, IsaacLab, and MuJoCo environments.
Post-training Large Language Models for Diverse High-Quality Responses: This paper proposes DQO (Diversity Quality Optimization), which defines a diversity metric in semantic embedding space via determinantal point processes (DPP), and jointly optimizes it with reward signals to simultaneously improve semantic diversity and response quality during LLM post-training. DQO can be stacked on top of GRPO/PPO.
PreferThinker: Reasoning-based Personalized Image Preference Assessment: This paper proposes PreferThinker, which introduces a universal visual preference profile to bridge across different users and adopts a predict-then-assess CoT reasoning paradigm for interpretable personalized image preference assessment. Combined with cold-start SFT and GRPO reinforcement learning along with a similarity-aware prediction reward, the 7B model outperforms GPT-4o (+5.2%) and Claude 3.7 (+5.1%).
Principled Fast and Meta Knowledge Learners for Continual Reinforcement Learning: Inspired by the hippocampus–neocortex interaction in the human brain, this paper proposes FAME, a dual-learner framework for continual reinforcement learning that employs a fast learner for knowledge transfer and a meta learner for knowledge consolidation, achieving efficient continual RL while principally minimizing catastrophic forgetting.
Pruning as a Cooperative Game: Surrogate-Assisted Layer Contribution Estimation for Large Language Models: This paper models LLM layer pruning as a cooperative game, employing a lightweight surrogate network to approximate Shapley values that capture inter-layer dependencies, achieving superior deep pruning performance over static heuristic methods.
QuRL: Efficient Reinforcement Learning with Quantized Rollout: This paper proposes QuRL, a method that quantizes the actor model to accelerate the rollout phase in RL training. It introduces Adaptive Clipping Range (ACR) to address training collapse caused by quantization, and Update-Aware Quantization (UAQ) to resolve the scale mismatch between weight updates and quantization error. QuRL achieves 20%–80% inference throughput improvement without performance degradation.
REA-RL: Reflection-Aware Online Reinforcement Learning for Efficient Reasoning: REA-RL is a framework that employs a distilled small reflection model to online detect and truncate overthinking tokens, generating revised reasoning paths, while incorporating a reflection reward to prevent model degradation into non-reflective vanilla CoT during RL training. On DeepSeek-R1-Distill-Qwen-7B, it achieves a 36% reduction in reasoning token consumption with zero accuracy loss.
Reasoning as Representation: Rethinking Visual Reinforcement Learning in Image Quality Assessment: Through systematic experimentation, this paper reveals the fundamental mechanism underlying the generalization capability of RL-trained reasoning-based IQA models — the reasoning process essentially transforms redundant visual representations into compact, cross-domain aligned textual representations. Building on this insight, the paper proposes the RALI algorithm, which directly aligns image representations to these textual representations via contrastive learning, achieving comparable generalization performance with less than 5% of the parameters and inference time.
Reasoning Boosts Opinion Alignment in LLMs: GRPO-based reinforcement learning is applied to train LLMs to align with individual political opinions via structured reasoning. SFT+GRPO consistently outperforms ICL and ORPO baselines across U.S., German, and Swiss datasets, while systematically revealing left–right ideological asymmetry and fundamental difficulty in predicting Neutral stances.
RebuttalAgent: Strategic Persuasion in Academic Rebuttal via Theory of Mind: This paper is the first to introduce Theory of Mind (ToM) into academic rebuttal, proposing a three-stage ToM-Strategy-Response (TSR) framework: first modeling the reviewer's mental state, then formulating a persuasion strategy, and finally generating evidence-grounded responses. Combined with self-reward RL training and a dedicated Rebuttal-RM evaluator, the approach achieves an average improvement of 18.3% over the base model.
References Improve LLM Alignment in Non-Verifiable Domains: This paper proposes RefEval, a reference-guided LLM-as-Judge framework that uses high-quality reference outputs as "soft verifiers," improving LLM-judge accuracy by 6.8%. Building on this, the authors design a two-stage self-improvement pipeline (SFT distillation + reference-guided DPO) that outperforms SFT distillation alone by +19.2/+16.5 on AlpacaEval/Arena-Hard, matching the performance of the fine-tuned reward model ArmoRM — demonstrating that effective LLM alignment in non-verifiable domains is achievable without human preference annotation.
ReFORM: Reflected Flows for On-support Offline RL via Noise Manipulation: ReFORM is proposed to manipulate the source distribution of a behavior cloning (BC) flow policy by learning a reflected flow noise generator, achieving support constraints in a constructive manner that avoids OOD issues while preserving policy expressiveness, without requiring hyperparameter tuning.
Regret-Guided Search Control for Efficient Learning in AlphaZero: This paper proposes RGSC (Regret-Guided Search Control), a framework that trains a regret network to identify high-regret states and prioritizes restarting self-play from these states, emulating the human learning strategy of repeatedly reviewing mistakes. RGSC outperforms AlphaZero by an average of 77 Elo across 9×9 Go, 10×10 Othello, and 11×11 Hex.
Pruning as a Cooperative Game: Surrogate-Assisted Layer Contribution Estimation for Large Language Models: Layer pruning in LLMs is formulated as a cooperative game (each layer = player, model performance = utility) → exact Shapley value computation is infeasible ($2^L$ combinations) → a two-stage approximation is proposed: (1) stratified Monte Carlo sampling generates masks + evaluates PPL as supervision signals → (2) a lightweight surrogate network is trained to predict the performance of arbitrary masks → efficient per-layer Shapley value estimation → captures inter-layer dependencies → substantially outperforms static heuristic pruning baselines.
ReMix: Reinforcement Routing for Mixtures of LoRAs in LLM Finetuning: ReMix identifies a severe routing weight collapse problem in existing Mixture-of-LoRAs models (even when $k>1$ LoRAs are activated, the effective LoRA count rapidly degenerates to 1), proposes non-learnable constant routing weights to ensure equal contribution from all activated LoRAs, and trains the router using the RLOO reinforcement learning gradient estimator, significantly outperforming state-of-the-art PEFT methods.
ReMoT: Reinforcement Learning with Motion Contrast Triplets: ReMoT proposes a unified training paradigm that systematically enhances VLM spatiotemporal consistency reasoning through a rule-driven motion contrast triplet dataset (ReMoT-16K) and Group Relative Policy Optimization (GRPO) with composite reward optimization, achieving a 25.1% performance gain on spatiotemporal reasoning benchmarks.
Retaining Suboptimal Actions to Follow Shifting Optima in Multi-Agent RL: This paper proposes S2Q (Successive Sub-value Q-learning), which explicitly retains suboptimal joint actions by successively learning $K$ sub-value functions. Combined with a Softmax behavior policy for prioritized sampling among candidates, S2Q addresses the root cause of suboptimal convergence in cooperative MARL value decomposition methods—namely, that policy optima shift dynamically during training.
Rethinking Policy Diversity in Ensemble Policy Gradient in Large-Scale Reinforcement Learning: This paper theoretically analyzes how inter-policy diversity affects learning efficiency in ensemble policy gradient methods, and proposes Coupled Policy Optimization (CPO), which regulates diversity via KL divergence constraints to achieve efficient and stable exploration in large-scale parallel environments.
Revisiting Matrix Sketching in Linear Bandits: Achieving Sublinear Regret via Dyadic Block Sketching: This paper identifies a fundamental flaw in existing sketch-based linear bandit methods: when the spectrum of the streaming matrix has a heavy tail, these methods degenerate to linear regret. To address this, the paper proposes the Dyadic Block Sketching (DBS) framework, which dynamically doubles the sketch size to control the global approximation error within a user-specified parameter $\epsilon$. The resulting algorithm guarantees sublinear regret without requiring prior knowledge of the spectral structure of the streaming matrix, and adaptively recovers the computational efficiency of single-scale methods when the spectrum is favorable.
RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning: This paper proposes the RewardMap framework, which addresses the sparse reward problem in fine-grained visual reasoning through difficulty-aware detail reward design and a multi-stage RL curriculum that progresses from simple perception to complex reasoning.
RLP: Reinforcement as a Pretraining Objective: This paper proposes RLP (Reinforcement Learning Pretraining), an information-gain-driven RL pretraining objective that rewards Chain-of-Thought (CoT) reasoning when it improves next-token prediction probability. RLP shifts reinforcement learning from the post-training stage into pretraining, enabling dense reward signals without any verifier.
RM-R1: Reward Modeling as Reasoning: This paper reframes reward modeling as a reasoning task, introducing the RM-R1 family of Reasoning Reward Models (ReasRM). Through reasoning distillation combined with RL training and a Chain-of-Rubrics (CoR) mechanism, RM-R1 outperforms 70B and GPT-4o models by an average of 4.9% across three major reward model benchmarks.
Robust Deep Reinforcement Learning against Adversarial Behavior Manipulation: This paper studies a novel threat in RL—behavior-targeted attacks (where an adversary manipulates observations to steer the victim toward executing a specific target policy)—and proposes BIA, a black-box attack method, along with TDRT, a temporally discounted robust training defense. TDRT achieves robustness against such attacks while outperforming the existing defense SA-PPO on original task performance by 28.2%.
Robust Multi-Objective Controlled Decoding of Large Language Models: This paper proposes RMOD (Robust Multi-Objective Decoding), an inference-time algorithm that dynamically computes worst-case objective weights by solving for the Nash equilibrium of a minimax game, achieving robust multi-objective alignment of LLMs without requiring any prior knowledge of objective weights.
Routing, Cascades, and User Choice for LLMs: This paper models LLM routing as a provider-user Stackelberg game, proves that the optimal routing policy is almost always a static, cascade-free threshold rule, reveals user-provider misalignment when quality/cost rankings are inconsistent, and shows that under low churn penalties providers are incentivized to inflate latency via throttling to reduce cost at the expense of user utility.
RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling: RuleReasoner constructs a diverse rule reasoning dataset, RuleCollection-32K, and proposes a domain-aware dynamic sampling strategy (Dads). Under the RLVR framework, an 8B model trained with this approach outperforms OpenAI-o1 by 4.1% on in-distribution reasoning tasks and by 10.4% on out-of-distribution tasks, while achieving approximately 1.4× training efficiency improvement.
Safe Continuous-time Multi-Agent Reinforcement Learning via Epigraph Form: This paper proposes the first continuous-time multi-agent RL framework that explicitly handles state constraints. By reformulating the discontinuous constrained value function into a continuous representation via the epigraph form, and combining an improved PINN-based actor-critic method, the framework achieves safe and stable continuous-time multi-agent control.
Sample-efficient and Scalable Exploration in Continuous-Time RL: This paper proposes COMBRL, an algorithm that achieves scalable and sample-efficient exploration in continuous-time model-based RL by maximizing a weighted sum of extrinsic reward and epistemic uncertainty, with theoretical guarantees of sublinear regret.
Scalable Exploration for High-Dimensional Continuous Control via Value-Guided Flow: This paper proposes Qflex (Q-guided Flow Exploration), a scalable RL method for exploration in high-dimensional continuous action spaces. It transports actions from a learnable source distribution along a probability flow induced by the Q-function, aligning exploration with task-relevant gradients rather than isotropic noise. Qflex outperforms Gaussian and diffusion-based RL baselines across various high-dimensional benchmarks, and successfully controls a full-body musculoskeletal model with 700 actuators to perform agile and complex motions.
Scalable In-Context Q-Learning: This paper proposes S-ICQL, which introduces dynamic programming (Q-learning) and world models into the supervised ICRL framework. A multi-head Transformer simultaneously predicts the policy and in-context value functions, a pretrained world model constructs lightweight yet accurate prompts, and advantage-weighted regression is used for policy extraction. S-ICQL consistently outperforms all baselines when learning from suboptimal data in both discrete and continuous environments.
Self-Harmony: Learning to Harmonize Self-Supervision and Self-Play in Test-Time Reinforcement Learning: This paper proposes the Self-Harmony framework, in which a single model plays two roles—a Solver that addresses the original problem and a Reframer that paraphrases it—and uses the harmonic mean of answer scores across both perspectives as a pseudo-label selection criterion, replacing conventional majority voting. The approach achieves state-of-the-art performance in 28 out of 30 experimental settings with zero training failures.
Self-Improving Skill Learning for Robust Skill-based Meta-Reinforcement Learning: This paper proposes SISL (Self-Improving Skill Learning), which decouples the high-level exploitation policy from a dedicated skill improvement policy, and incorporates a maximum return relabeling mechanism for skill prioritization. SISL achieves robust skill learning under noisy offline demonstration data and substantially improves the performance of skill-based meta-reinforcement learning on long-horizon tasks.
Shop-R1: Rewarding LLMs to Simulate Human Behavior in Online Shopping via Reinforcement Learning: This paper proposes Shop-R1, a framework that leverages hierarchical reward design and difficulty-aware reward scaling within a reinforcement learning paradigm to substantially improve LLMs' ability to simulate realistic human online shopping behavior, achieving over 65% improvement in exact action match compared to SFT baselines.
Single Index Bandits: Generalized Linear Contextual Bandits with Unknown Reward Functions: This paper introduces the Single Index Bandit (SIB) problem — extending generalized linear bandits to the setting where the reward function is unknown — and proposes a family of efficient algorithms (STOR/ESTOR/GSTOR) based on Stein's method, achieving near-optimal regret $\tilde{O}(\sqrt{T})$ under monotone increasing reward functions.
Solving Football by Exploiting Equilibrium Structure of 2p0s Differential Games with One-Sided Information: This paper proves the atomic structure of Nash equilibrium strategies in two-player zero-sum differential games with one-sided information: the equilibrium strategy of the informed player P1 concentrates on at most $I$ action prototypes (where $I$ = number of game types), reducing game tree complexity from $U^{2K}$ to $I^K$. This enables an M1 MacBook to solve 11v11 American football with continuous action spaces (traditional complexity $10^{440}$) in 30 minutes.
Solving Parameter-Robust Avoid Problems with Unknown Feasibility using Reinforcement Learning: This paper proposes Feasibility-Guided Exploration (FGE), which simultaneously identifies the feasible parameter subset and learns a safe policy over that subset, addressing parameter-robust avoid problems with unknown feasibility. FGE covers more than 50% additional safe parameters compared to the best existing methods on MuJoCo tasks.
Spectral Bellman Method: Unifying Representation and Exploration in RL: This paper proposes the Spectral Bellman Method (SBM), which derives a spectral relationship between the Bellman operator and feature covariance structure from the zero Intrinsic Bellman Error (IBE) condition, leading to a novel representation learning objective that naturally unifies representation learning and Thompson Sampling–based exploration.
SPELL: Self-Play Reinforcement Learning for Evolving Long-Context Language Models: This paper proposes the SPELL framework, in which a single LLM simultaneously assumes three roles—question generator, responder, and verifier—engaging in self-play reinforcement learning without human annotation to continuously improve long-context reasoning, achieving consistent performance gains across 6 long-context benchmarks.
SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning: This paper proposes SPIRAL, a framework that trains LLMs via self-play in multi-turn zero-sum games. Through Role-conditioned Advantage Estimation (RAE) to stabilize training, SPIRAL improves reasoning performance by up to 10% without domain-specific data, and reveals that different games cultivate complementary cognitive abilities.
Spotlight on Token Perception for Multimodal Reinforcement Learning: This paper proposes VPPO (Visually-Perceptive Policy Optimization), which quantifies the visual dependency of each token and refines learning signals at both the trajectory level and the token level, significantly enhancing the multimodal reasoning capabilities of large vision-language models.
Stackelberg Coupling of Online Representation Learning and Reinforcement Learning: This paper proposes SCORER, a framework that models representation learning and value function learning in Deep Q-Learning as a Stackelberg game. Through two-timescale updates—where the Q-network acts as the slow-updating leader and the encoder as the fast-updating follower—SCORER achieves stable co-adaptation without modifying the network architecture.
Stop Unnecessary Reflection: Training LRMs for Efficient Reasoning with Adaptive Reflection and Length Coordinated Penalty: This paper proposes ARLCP (Adaptive Reflection and Length Coordinated Penalty), an adaptive reinforcement learning method that dynamically adjusts the weights of reflection and length penalties according to problem complexity, substantially reducing token consumption while maintaining or improving accuracy.
Strict Subgoal Execution: Reliable Long-Horizon Planning in Hierarchical Reinforcement Learning: This paper proposes SSE (Strict Subgoal Execution), a framework that strictly distinguishes between successful and failed subgoal reaching via Frontier Experience Replay (FER), combined with a decoupled exploration policy and failure-aware path optimization. By enforcing subgoal completion within each high-level step, SSE substantially reduces the number of high-level decisions and improves success rates on long-horizon tasks.
SUSD: Structured Unsupervised Skill Discovery through State Factorization: This paper proposes SUSD (Structured Unsupervised Skill Discovery), which factorizes the state space into independent factors and assigns dedicated skill variables to each factor. Combined with a curiosity-driven factor-weighting mechanism, SUSD discovers diverse skills that cover all controllable factors in complex multi-object and multi-agent environments.
$\textbf{Re}^{2}$: Unlocking LLM Reasoning via Reinforcement Learning with Re-solving: This paper proposes Re², a pure reinforcement learning method that trains LLMs to actively abandon unproductive reasoning chains and restart the solving process during inference. The approach amplifies the rare redo behavior from ~0.5% to over 30%, achieving significant improvements over standard RLVR methods under the same training compute budget.
The Sample Complexity of Online Reinforcement Learning: A Multi-Model Perspective: This paper proposes an online reinforcement learning algorithm for nonlinear dynamical systems with continuous state-action spaces. By combining multi-model posterior sampling with certainty-equivalence control, the algorithm enables online learning of unknown systems and provides non-asymptotic policy regret guarantees that scale from finite model sets to parametric model families.
Thermodynamics of Reinforcement Learning Curricula: This paper formalizes curriculum learning in RL as a geodesic optimization problem over task space using a framework of excess work minimization drawn from non-equilibrium thermodynamics, and derives the MEW temperature annealing algorithm based on a friction tensor, outperforming standard SAC temperature scheduling on the MuJoCo Humanoid task.
Thinking on the Fly: Test-Time Reasoning Enhancement via Latent Thought Policy Optimization: This paper proposes Latent Thought Policy Optimization (LTPO), a test-time reasoning enhancement framework that requires no model parameter updates. By treating intermediate latent "thought" vectors as dynamically optimizable variables, LTPO leverages online policy gradient methods and intrinsic confidence reward signals to enhance the reasoning capability of frozen LLMs.
Toward a Dynamic Stackelberg Game-Theoretic Framework for Agent-Based Conversational AI Defense Against LLM Jailbreaking: This paper formalizes LLM jailbreaking attack-defense interactions as a dynamic Stackelberg extensive-form game, integrates Rapidly-exploring Random Tree (RRT) search over the prompt space, and proposes the Purple Agent defense architecture that achieves proactive defense through "red-team thinking, blue-team action."
Towards Bridging the Gap between Large-Scale Pretraining and Efficient Finetuning for Humanoid Control: LIFT proposes a three-stage pretraining-finetuning framework: (i) large-scale parallel SAC pretraining for zero-shot deployment; (ii) offline pretraining of a physics-prior world model based on Lagrangian dynamics; (iii) efficient finetuning via deterministic action execution in the environment combined with stochastic exploration within the world model. The full sim-to-real pipeline is validated on Booster T1 and Unitree G1 humanoid robots.
Towards Strategic Persuasion with Language Models: Grounded in the Bayesian Persuasion framework, this paper proposes a systematic methodology for evaluating and training the strategic persuasion capabilities of LLMs. It finds that frontier models already exhibit significant strategic persuasion ability, and that even small LLMs can substantially improve their persuasive performance through reinforcement learning.
TPRU: Advancing Temporal and Procedural Understanding in Large Multimodal Models: TPRU constructs a large-scale multi-image temporal understanding dataset (24,750 QA pairs, 126,000 images) spanning 3 complementary task types (temporal ordering, next-frame prediction, previous-frame review) across 4 embodied scenarios including robotic manipulation and GUI navigation, and demonstrates that RL fine-tuning enables a 7B model to surpass GPT-4o on temporal understanding benchmarks.
TRACED: Transition-aware Regret Approximation with Co-learnability for Environment Design: TRACED improves regret approximation in Unsupervised Environment Design (UED) by augmenting the conventional PVL with an Approximate Transition Prediction Loss (ATPL) to capture dynamics model mismatch, and introduces a Co-Learnability measure to quantify inter-task transfer benefits. On MiniGrid and BipedalWalker, TRACED surpasses all baselines' 20k-update performance using only 10k updates.
Transitive RL: Value Learning via Divide and Conquer: This paper proposes Transitive Reinforcement Learning (TRL), a novel value function learning algorithm based on the divide-and-conquer paradigm. By exploiting the triangle inequality structure inherent in goal-conditioned RL, TRL recursively decomposes value function updates into subproblems, achieving superior performance over TD learning and Monte Carlo methods on long-horizon tasks.
Trinity: An Evolved LLM Coordinator: Trinity introduces a lightweight coordinator (0.6B SLM + ~10K trainable parameters in a linear head) optimized via sep-CMA-ES. In multi-turn dialogues, the coordinator routes queries to different LLMs and assigns one of three roles—Thinker, Worker, or Verifier—achieving 86.2% pass@1 SOTA on LiveCodeBench and consistently outperforming all single-model and multi-agent baselines across 4 in-distribution and 4 out-of-distribution tasks.
TROLL: Trust Regions improve Reinforcement Learning for Large Language Models: This paper proposes TROLL (Trust Region Optimization for Large Language models), which replaces the clipping mechanism in PPO with a differentiable discrete trust-region projection, enabling token-level policy updates under principled KL constraints. TROLL consistently outperforms PPO-clip on mathematical reasoning and code generation tasks.
UME-R1: Exploring Reasoning-Driven Generative Multimodal Embeddings: This paper proposes UME-R1, the first framework to explore a reasoning-driven generative multimodal embedding paradigm. Through a two-stage training pipeline (cold-start SFT followed by reinforcement learning), the embedding model learns to reason before generating representations, achieving significant improvements over traditional discriminative embedding models across 78 tasks on the MMEB-V2 benchmark.
Understanding and Improving Hyperbolic Deep Reinforcement Learning: Through closed-form gradient analysis, this paper identifies the root causes of instability in hyperbolic deep RL—namely, conformal factor explosion in the Poincaré Ball and PPO trust-region breakdown induced by large-norm embeddings. It proposes Hyper++, a four-component solution comprising RMSNorm, learnable scaling, HL-Gauss categorical value loss, and the Hyperboloid model, achieving comprehensive improvements over prior baselines on ProcGen (16 environments) and Atari-5.
unsupervised learning of efficient exploration pre-training adaptive policies vi: This paper proposes ULEE, an unsupervised meta-learning method that trains adaptive policies via adversarially self-generated goal curricula, achieving efficient exploration and few-shot adaptation on the XLand-MiniGrid benchmark.
Unveiling the Cognitive Compass: Theory-of-Mind-Guided Multimodal Emotion Reasoning: This paper constructs HitEmotion, a hierarchical multimodal emotion understanding benchmark grounded in Theory of Mind (ToM), and proposes the TMPO framework, which leverages intermediate mental states as process-level supervision to enhance the emotion reasoning capabilities of MLLMs.
Value Flows: Value Flows is the first work to introduce flow matching into distributional RL — it learns a vector field such that the induced probability density path automatically satisfies the distributional Bellman equation. Variance of the return distribution is efficiently estimated via a flow derivative ODE, enabling confidence-weighted prioritized learning. The method achieves an average 1.3× improvement in success rate across 62 OGBench tasks, and estimates return distributions 3× more accurately than C51/CODAC.
VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models: This work introduces VerifyBench and VerifyBench-Hard, two evaluation benchmarks targeting reference-based reward systems widely used in training Large Reasoning Models (LRMs). Through rigorous human annotation, the benchmarks assess the accuracy of various verification systems and reveal that even the strongest models achieve only approximately 88% accuracy on hard samples, exposing substantial room for improvement in current verification systems.
Virne: A Comprehensive Benchmark for RL-based Network Resource Allocation in NFV: This paper proposes Virne — a comprehensive benchmark framework for Network Function Virtualization Resource Allocation (NFV-RA) — integrating 30+ algorithms and a gym-style environment to support systematic evaluation across cloud, edge, 5G, and other scenarios.
Whatever Remains Must Be True: Filtering Drives Reasoning in LLMs, Shaping Diversity: This paper proposes the DMVR framework and the α-DPG algorithm. By explicitly defining a target distribution that "filters out incorrect answers" and approximating it via the α-divergence family, the work unifies RLVR (Reverse KL) and rejection sampling fine-tuning (Forward KL), achieving Pareto-optimal performance on the accuracy–coverage frontier for Lean theorem proving.
When Sensors Fail: Temporal Sequence Models for Robust PPO under Sensor Drift: This paper investigates the robustness of PPO under temporally persistent sensor failures, proposes integrating sequence models (Transformer and SSMs) into PPO, derives high-probability upper bounds on infinite-horizon reward degradation under stochastic sensor failures, and demonstrates through MuJoCo experiments that Transformer-PPO significantly outperforms MLP, RNN, and SSM baselines under severe sensor dropout.
WIMLE: Uncertainty-Aware World Models with IMLE for Sample-Efficient Continuous Control: WIMLE extends Implicit Maximum Likelihood Estimation (IMLE) to model-based RL, learning stochastic world models capable of capturing multimodal transition dynamics. Predictive uncertainty is estimated via ensemble and latent sampling, and is used to weight the RL objective on synthetic data. Across 40 continuous control tasks, WIMLE achieves superior sample efficiency and asymptotic performance compared to strong model-free and model-based baselines.

🧩 Multimodal VLM¶

A-TPT: Angular Diversity Calibration Properties for Test-Time Prompt Tuning of Vision-Language Models: This paper proposes the A-TPT framework, which promotes angular diversity by maximizing the minimum pairwise angular distance among normalized text features on the unit hypersphere. It addresses the miscalibration caused by overconfident predictions in test-time prompt tuning (TPT) of VLMs, achieving superior performance over existing TPT calibration methods on both natural distribution shifts and medical datasets.
BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning: This paper proposes BEAT, the first visual backdoor attack framework targeting VLM-driven embodied agents. It employs environmental objects (e.g., knives) as triggers and adopts a two-stage training pipeline (SFT + Contrastive Trigger Learning) to achieve precise backdoor activation. BEAT attains an attack success rate of up to 80% while preserving normal task performance, exposing critical security vulnerabilities in VLM-based embodied agents.
BioCAP: Exploiting Synthetic Captions Beyond Labels in Biological Foundation Models: This paper proposes BioCAP, which trains a biological multimodal foundation model by using an MLLM to generate Wikipedia-knowledge-guided synthetic descriptive captions (rather than relying solely on species labels). BioCAP achieves an average improvement of 8.8% over BioCLIP across 10 species classification benchmarks and a 21.3% gain on text-image retrieval tasks.
Bongard-RWR+: Real-World Representations of Fine-Grained Concepts in Bongard Problems: Bongard-RWR+ is a benchmark comprising 5,400 Bongard problems, constructed via a VLM-based pipeline (Pixtral-12B + Flux.1-dev) that automatically generates photorealistic images to represent abstract concepts. Systematic evaluation reveals that state-of-the-art VLMs struggle to discriminate fine-grained visual concepts such as contour, rotation, and angle, with accuracy dropping as low as 19%.
Bootstrapping MLLM for Weakly-Supervised Class-Agnostic Object Counting (WS-COC): This paper proposes WS-COC, the first MLLM-based weakly supervised class-agnostic object counting framework. Through three strategies — divide-and-discern dialogue tuning (progressively narrowing the counting range), comparative ranking optimization (learning relative counting relationships across images), and global-local counting enhancement — WS-COC achieves performance comparable to or surpassing fully supervised methods using only image-level count annotations.
Breaking the Limits of Open-Weight CLIP: An Optimization Framework for Self-supervised Fine-tuning of CLIP: This paper proposes TuneCLIP, a self-supervised fine-tuning (SSFT) framework that improves existing open-weight CLIP models through a two-stage design — first recovering optimizer statistics (OSR) to eliminate cold-start bias, then applying a hinged global contrastive loss (HGCL) with a margin to mitigate over-penalization of false negatives — achieving consistent general-purpose performance gains without any labels, with improvements of up to +2.5% on ImageNet and variants and +1.2% on the DataComp benchmark.
Can Vision-Language Models Answer Face to Face Questions in the Real-World?: This paper introduces QIVD (Qualcomm Interactive Video Dataset), a face-to-face real-time QA benchmark comprising 2,900 videos with audio and timestamp annotations. It reveals that existing VLMs fall far short of human performance in real-time situated understanding (best model 60% vs. humans 87%), with primary bottlenecks in referential disambiguation, response timing judgment, and situated commonsense. Fine-tuning on this data can substantially close the gap.
Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts: To address the Straggler Effect in MoE inference—where the most heavily loaded expert determines overall latency due to uneven token distribution—this paper proposes Capacity-Aware Token Drop (discarding low-scoring tokens from overloaded experts) and Expanded Drop (re-routing overflow tokens to lightly loaded local experts). The approach achieves a 1.85× speedup on Mixtral-8×7B with a 0.2% performance improvement.
CityLens: Evaluating Large Vision-Language Models for Urban Socioeconomic Sensing: CityLens is introduced as the largest urban socioeconomic sensing benchmark to date (17 cities, 6 domains, 11 prediction tasks), evaluating 17 LVLMs across three paradigms—direct metric prediction, normalized metric estimation, and feature-based regression—for inferring socioeconomic indicators from satellite and street-view imagery. Results show that general-purpose LVLMs still fall short of domain-specialized contrastive learning methods on most tasks.
Closing the Modality Gap Aligns Group-Wise Semantics: This paper demonstrates that the modality gap in CLIP is inconsequential for instance-level tasks (retrieval) yet severely harms group-level tasks (clustering). It proposes a novel objective comprising an Align True Pairs loss and a Centroid Uniformity loss that reduces the gap to nearly zero in both bimodal and trimodal settings, substantially improving clustering V-Measure by +10–17 points while preserving retrieval performance.
Constructive Distortion: Improving MLLMs with Attention-Guided Image Warping: This paper proposes AttWarp, a plug-and-play test-time image warping method that leverages the MLLM's own cross-modal attention maps to perform rectilinear grid resampling — expanding high-attention regions and compressing low-attention regions — achieving consistent accuracy improvements, enhanced compositional reasoning, and reduced hallucinations across 5 benchmarks and 4 MLLMs.
Contamination Detection for VLMs using Multi-Modal Semantic Perturbation: This paper proposes a multi-modal semantic perturbation framework for detecting data contamination in VLMs. It uses an LLM to generate dense captions and Flux ControlNet to alter answer-relevant semantic elements while preserving image composition. Contaminated models suffer sharp performance drops on perturbed samples due to memorization of original image-text pairs, whereas clean models are unaffected thanks to genuine reasoning ability. The paper also provides the first systematic validation that most existing LLM-based contamination detection methods are unreliable in VLM settings.
Context Tokens are Anchors: Understanding the Repetition Curse in dMLLMs from an Information Flow Perspective: This work investigates the underlying mechanism behind the "Repetition Curse" in diffusion multimodal large language models (dMLLMs) when cache-based acceleration is applied, through an information flow perspective. It reveals that context tokens act as anchors that aggregate semantic information, and that caching disrupts this information flow pattern. The proposed CoTA method reduces repetition rates by up to 92%.
Customizing Visual Emotion Evaluation for MLLMs: An Open-vocabulary, Multifaceted, and Scalable Approach: This paper proposes the Emotion Statement Judgment (ESJ) task and the INSETS automatic annotation pipeline, reformulating visual emotion evaluation from "open-ended classification" to "statement veracity judgment." The authors construct the MVEI benchmark (3,086 samples, 424 emotion labels, four cognitive dimensions) and systematically evaluate 19 MLLMs, finding that even GPT-4o lags behind humans (91.6%) by 13.3% in accuracy.
Detecting Misbehaviors of Large Vision-Language Models by Evidential Uncertainty Quantification: This paper proposes EUQ (Evidential Uncertainty Quantification), which leverages Dempster-Shafer evidence theory to decompose the epistemic uncertainty of LVLMs into conflict CF (internal contradictions) and ignorance IG (lack of information). EUQ requires no training and only a single forward pass to detect four types of misbehaviors—hallucination, jailbreak, adversarial attacks, and OOD failures—achieving an average AUROC improvement of 10.4%/7.5% over the best baseline.
Directional Embedding Smoothing for Robust Vision Language Models: This paper extends RESTA (Randomized Embedding Smoothing and Token Aggregation) from LLMs to VLMs, demonstrating that directional embedding noise significantly outperforms isotropic noise in the safety-utility tradeoff, serving as a lightweight inference-time defense layer against multimodal jailbreak attacks.
DIVA-GRPO: Enhancing Multimodal Reasoning through Difficulty-Adaptive Variant Advantage: This paper proposes DIVA-GRPO, which addresses reward sparsity and advantage vanishing in GRPO training by dynamically assessing question difficulty, adaptively generating semantically consistent variants of varying difficulty, and incorporating difficulty-weighted local-global advantage estimation. The method achieves state-of-the-art multimodal reasoning performance at the 7B model scale.
Do Vision-Language Models Respect Contextual Integrity in Location Disclosure?: This paper introduces the VLM-GEOPRIVACY benchmark grounded in Nissenbaum's Contextual Integrity (CI) theory. Through seven progressively structured context-aware questions and a three-tier location disclosure granularity (refusal / city-level / precise location), it systematically evaluates whether 14 mainstream VLMs can determine appropriate location disclosure levels based on social-norm cues present in images. Results show that all models exhibit severe over-disclosure bias (Over-Disclosure rates of 46–52%), and malicious prompting can push the Abstention Violation rate to 100%.
Dynamic Multimodal Activation Steering for Hallucination Mitigation in Large Vision-Language Models: This paper proposes Dynamic Multimodal Activation Steering (DMAS), a training-free method that constructs a semantics-based truthfulness steering vector database and a visual perception steering vector, dynamically selecting the most relevant steering vectors at inference time to intervene on critical attention heads. DMAS significantly mitigates hallucinations in LVLMs, achieving a gain of 94.66 points on MME and a 20.2% reduction in hallucination rate on CHAIR.
EgoHandICL: Egocentric 3D Hand Reconstruction with In-Context Learning: This work introduces the in-context learning (ICL) paradigm to 3D hand reconstruction for the first time. Through VLM-guided template retrieval, a multimodal ICL tokenizer, and an MAE-driven reconstruction pipeline, EgoHandICL significantly outperforms state-of-the-art methods on the ARCTIC and EgoExo4D benchmarks.
Empowering Small VLMs to Think with Dynamic Memorization and Exploration: This paper proposes DyME (Dynamic Memorize-Explore), which progressively and dynamically alternates between an SFT memorization mode and a GRPO exploration mode, enabling—for the first time—reasoning capabilities in small-scale vision-language models (SVLMs, <1B parameters) on domain-specific tasks.
Enhanced Continual Learning of Vision-Language Models with Model Fusion: This paper proposes the Continual Decoupling-Unifying (ConDU) framework, which is the first to introduce model fusion into VLM continual learning. By maintaining a unified model and performing iterative decoupling-unifying operations guided by task triggers, ConDU surpasses the state of the art by an average of 2% on the MTIL benchmark while simultaneously enhancing zero-shot capability.
Enhancing Multi-Image Understanding through Delimiter Token Scaling: By scaling the hidden states of image delimiter tokens in vision-language models, this work enhances inter-image information isolation and achieves performance gains on multi-image understanding benchmarks (Mantis/MuirBench/MIRB/QBench2) and multi-document/multi-table understanding benchmarks (TQABench/MultiNews/WCEP-10) without introducing any additional training or inference cost.
Error Notebook-Guided, Training-Free Part Retrieval in 3D CAD Assemblies via Vision-Language Models: This paper proposes a training-free two-stage VLM framework that records corrected reasoning trajectories in an Error Notebook and applies RAG-based test-time adaptation. On specification-driven part retrieval in 3D CAD assemblies, GPT-4o accuracy improves from 41.7% to 65.1% (+23.4%), with a further +4.5% gain from a grammar-constrained validator.
Evaluating VLMs' Spatial Reasoning Over Robot Motion: A Step Towards Robot Planning with Motion Preferences: This paper systematically evaluates VLMs' spatial reasoning capabilities over robot motion trajectories, proposing four image-querying methods that enable VLMs to select optimal motion paths based on user natural language descriptions. Results show that Qwen2.5-VL achieves 71.4% zero-shot accuracy, with smaller models achieving significant gains after fine-tuning.
FRIEDA: Benchmarking Multi-Step Cartographic Reasoning in Vision-Language Models: This paper introduces FRIEDA, a benchmark that systematically evaluates large vision-language models (LVLMs) on multi-step, cross-map cartographic reasoning. The strongest model, Gemini-2.5-Pro, achieves only 38.20% accuracy, far below the human baseline of 84.87%.
GLYPH-SR: Can We Achieve Both High-Quality Image Super-Resolution and High-Fidelity Text Recovery via VLM-Guided Latent Diffusion Model?: This paper proposes GLYPH-SR, a vision-language-guided diffusion framework that simultaneously optimizes image quality and text readability via a dual-branch Text-SR fusion ControlNet and a ping-pong scheduler, achieving a 15.18-point improvement in OCR F1 on SVT ×8.
Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs: This paper proposes GAR (Grasp Any Region), which employs RoI-aligned feature replay to extract high-fidelity local features while preserving global context, enabling precise single-region captioning, multi-region interaction modeling, and compositional reasoning. The 1B model surpasses InternVL3-78B.
Grounding-IQA: Grounding Multimodal Language Models for Image Quality Assessment: This paper integrates spatial grounding (referring + grounding) with image quality assessment (IQA), constructs the GIQA-160K dataset to fine-tune a multimodal LLM that generates quality descriptions with bounding boxes and spatial VQA, achieving significantly superior fine-grained quality perception over general-purpose MLLMs.
GTR-Bench: Evaluating Geo-Temporal Reasoning in Vision-Language Models: This paper introduces GTR-Bench, a novel benchmark for geo-temporal reasoning of moving targets in large-scale camera networks. Evaluation reveals that the strongest model, Gemini-2.5-Pro (34.9%), falls far short of human performance (78.61%), exposing three critical deficiencies in current VLMs: imbalanced utilization of spatial-temporal context, weak temporal prediction capability, and insufficient map-video alignment.
HiDrop: Hierarchical Vision Token Reduction in MLLMs via Late Injection, Concave Pyramid Pruning, and Early Exit: This paper proposes the HiDrop framework, which conducts a systematic layer-wise behavioral analysis of MLLMs (shallow layers = propagators, middle layers = fusion hubs, deep layers = language reasoners) and designs a three-stage strategy: Late Injection (skipping shallow layers) + Concave Pyramid Pruning (aggressive pruning in middle layers) + Early Exit (discarding tokens in deep layers). The framework compresses approximately 90% of visual tokens with negligible performance degradation and achieves a 1.72× training speedup.
How Do Medical MLLMs Fail? A Study on Visual Grounding in Medical Images: This work presents the first systematic diagnosis revealing that the root cause of poor zero-shot medical VQA performance in medical MLLMs is insufficient visual grounding—model attention systematically deviates from clinically relevant regions. Building on this finding, the authors propose VGRefine, a training-free inference-time attention correction method that achieves state-of-the-art results across 110K+ samples on 6 benchmarks spanning 8 imaging modalities.
ICYM2I: The Illusion of Multimodal Informativeness under Missingness: This paper identifies a largely overlooked problem in multimodal learning: distribution shift induced by modality missingness leads to severely biased modality value estimation. The proposed ICYM2I framework applies dual inverse probability weighting (IPW) to correct bias in both training and evaluation, achieving unbiased estimates of modality predictive utility and information-theoretic value under the MAR assumption.
Index-Preserving Lightweight Token Pruning for Efficient Document Understanding: A binary patch classifier with only 203K parameters is inserted before the VLM visual encoder to remove background tokens from document images. A $3 \times 3$ max-pooling operation is then applied to recover fragmented text regions while preserving original spatial indices, achieving 40–60% FLOPs reduction on Qwen2.5-VL with accuracy degradation of no more than ~5 percentage points.
IVC-Prune: Revealing the Implicit Visual Coordinates in LVLMs for Vision Token Pruning: This paper reveals the implicit visual coordinate (IVC) system established by RoPE positional encoding within LVLMs, and proposes a training-free, prompt-aware vision token pruning strategy that preserves IVC tokens and semantic foreground tokens while pruning approximately 50% of visual tokens with ≥99% of original performance retained.
K-Sort Eval: Efficient Preference Evaluation for Visual Generation via Corrected VLM-as-a-Judge: This paper proposes the K-Sort Eval framework, which leverages posterior correction and dynamic matching strategies to enable VLMs to reliably and efficiently replace human annotators in preference evaluation of visual generation models, typically converging to results consistent with human Arena rankings in fewer than 90 model runs.
KeepLoRA: Continual Learning with Residual Gradient Adaptation: By analyzing the SVD decomposition of pretrained model weights, this paper identifies that general knowledge is encoded in the principal subspace while domain-specific knowledge resides in the residual subspace. KeepLoRA is proposed to constrain LoRA updates for new tasks within the residual subspace, while using gradient information for initialization to preserve plasticity, achieving an optimal balance among forward stability, backward stability, and plasticity in continual learning.
Let's Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification: This paper identifies a pervasive "agreement bias" in multimodal large language models (MLLMs) when used as agent behavior verifiers—whereby models systematically over-approve agent actions—and proposes Self-Grounded Verification (SGV), a two-step generation framework (first extracting behavioral priors, then performing conditioned verification) to mitigate this bias. SGV achieves up to 25 pp improvement in failure detection rate and 14 pp improvement in accuracy across web navigation, desktop manipulation, and robotic manipulation tasks.
LiveWeb-IE: A Benchmark For Online Web Information Extraction: This paper introduces LiveWeb-IE, the first benchmark for online web information extraction (WIE), covering multi-type data extraction including text, images, and hyperlinks. It further proposes the Visual Grounding Scraper (VGS) framework, which simulates human cognitive processes—visual scanning to locate regions → precise element localization → XPath generation—to achieve robust information extraction on dynamic webpages.
LLaVA-FA: Learning Fourier Approximation for Compressing Large Multimodal Models: This paper proposes LLaVA-FA, an efficient compression method for large multimodal models (LMMs) that performs joint low-rank and quantization weight approximation in the frequency domain. By exploiting the decorrelation property and conjugate symmetry of the Fourier transform, the method achieves more compact and accurate weight representations. It further introduces PolarQuant (polar coordinate quantization) and ODC (Optional Diagonal Calibration), surpassing existing efficient multimodal models on multiple benchmarks with minimal active parameters and computational cost.
Look Carefully: Adaptive Visual Reinforcements in Multimodal Large Language Models for Hallucination Mitigation: This paper proposes AIR (Adaptive vIsual Reinforcement), a framework that reduces hallucinations in MLLMs at inference time without any training, via prototype-distance-based token reduction combined with optimal-transport-guided selective patch reinforcement (LLaVA-1.5-7B CHAIR_S: 22→18.4, POPE accuracy +5.3%), while preserving general multimodal capabilities.
Meta-Adaptive Prompt Distillation for Few-Shot Visual Question Answering: This paper proposes MAPD (Meta-Adaptive Prompt Distillation), a MAML-based prompt distillation framework that leverages an attention mapper to distill soft prompts from task-relevant image features, enabling LMMs to adapt to novel visual question answering tasks at test time with only a few gradient steps. MAPD outperforms ICL by 21.2%.
Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models: This paper identifies modality-specific and attention-head-specific semantic redundancy in the KV Cache of LVLMs, demonstrating that importance-only selection fails to preserve semantic coverage. The proposed MixKV adaptively mixes importance and diversity scores per attention head for KV Cache compression, achieving an average improvement of 5.1% under extreme compression ratios.
MMR-Life: Piecing Together Real-life Scenes for Multimodal Multi-image Reasoning: This paper introduces MMR-Life, a benchmark comprising 2,646 five-choice multi-image questions based on 19,108 real-life images, covering 7 reasoning types and 21 tasks. It is the first systematic evaluation of MLLMs on multi-image reasoning in real-life scenarios. The strongest model, GPT-5, achieves only 58.69% accuracy—14 percentage points below human performance. Key findings include the failure of reasoning enhancement methods on large models and the weaker generalization of RL compared to BoN.
MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs: This paper proposes MMTok, a multimodal visual token selection framework formulated as a Maximum Coverage Problem. By jointly leveraging text-visual and visual-visual coverage signals, MMTok selects the most informative subset of visual tokens in a training-free manner, significantly outperforming unimodal baselines and even surpassing methods that require fine-tuning.
Modal Aphasia: Can Unified Multimodal Models Describe Images From Memory?: This paper identifies and systematically defines the phenomenon of Modal Aphasia — unified multimodal models can generate visual concepts (e.g., movie poster images) from memory with near-perfect fidelity, yet exhibit error rates more than 7× higher when verbally describing the same concepts, with severe hallucinations occurring almost exclusively in the text modality. Through real-world experiments with frontier models (ChatGPT-5) and controlled synthetic experiments with open-source models (Janus-Pro, Harmon), the paper confirms that modal aphasia is a systemic deficiency of current unified architectures rather than a training artifact, and demonstrates its potential threat to AI safety frameworks.
Multi-modal Data Spectrum: Multi-modal Datasets are Multi-dimensional: A large-scale empirical study reveals severe unimodal dependency issues across 23 VQA benchmarks — many benchmarks designed to eliminate text bias have instead introduced image bias, with models exploiting unimodal shortcuts rather than performing genuine cross-modal reasoning.
Multimodal Classification via Total Correlation Maximization: This paper analyzes modality competition in multimodal classification from an information-theoretic perspective and proposes TCMax, a loss function that maximizes the Total Correlation (TC) between multimodal features and labels. TCMax simultaneously addresses joint learning, unimodal learning, and cross-modal alignment without additional hyperparameters, surpassing state-of-the-art methods on multiple audio-visual and image-text classification benchmarks.
Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs: This work is the first to extend automatic prompt optimization (APO) from the pure text space to the multimodal space, proposing the MPO framework. It achieves an average accuracy of 65.1% across 10 datasets spanning image, video, and molecular modalities—surpassing the strongest text-based APO baseline ProTeGi (60.0%)—via two key components: alignment-preserving joint exploration (unified semantic gradients synchronously drive text and non-text prompt updates, diversified by Generation/Edit/Mix operators) and prior-inherited Bayesian UCB candidate selection (warm-starting child prompt Beta priors from parent prompt performance).
OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models: Grounded in cognitive psychology, this work introduces OmniSpatial—the first comprehensive spatial reasoning benchmark—systematically covering 4 dimensions (dynamic reasoning, complex spatial logic, spatial interaction, and perspective transformation) across 50 subcategories with 8.4K manually annotated QA pairs. The strongest reasoning model, o3, achieves only 56.33% while humans reach 92.63%, revealing that complex spatial reasoning remains a fundamental bottleneck for VLMs.
On the Generalization Capacities of MLLMs for Spatial Intelligence: This paper identifies a fundamental flaw in RGB-only spatial reasoning MLLMs—the focal-length–depth ambiguity arising from the neglect of camera intrinsics—and proposes the Camera-Aware MLLM (CA-MLLM) framework. Through dense camera ray embedding, camera-aware data augmentation, and geometric prior distillation, it improves F1 from 39.1% to 52.1% on cross-camera generalization benchmarks for spatial localization.
Post-hoc Probabilistic Vision-Language Models: A training-free post-hoc uncertainty estimation method is proposed that applies Laplace approximation to the last few layers of VLMs such as CLIP and SigLIP, and analytically derives uncertainty over cosine similarity. The method achieves substantial improvements over baselines in both uncertainty quantification and active learning.
PPE: Positional Preservation Embedding for Token Compression in Multimodal Large Language Models: This paper proposes PPE (Positional Preservation Embedding), which exploits the dimensional independence of rotations in RoPE to encode multiple original position IDs from merged tokens into distinct dimension segments, enabling a single compressed token to carry multiple spatial/temporal positional cues. PPE is a zero-parameter, plug-and-play operator that achieves an average performance drop of only 3.6% on image tasks at 55% compression, and maintains comparable performance at 90% compression via cascaded compression.
PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies: This work introduces PRISMM-Bench, the first benchmark grounded in genuine reviewer-annotated multimodal inconsistencies in scientific papers. Mining 18,009 ICLR open reviews yields 384 cross-modal inconsistencies, evaluated across three tasks—identification, remediation, and paired matching—with a JSON-structured debiasing scheme for answer representation. Among 21 state-of-the-art LMMs, the best achieves only 53.9%, systematically exposing severe deficiencies in cross-modal reasoning over scientific documents.
Procedural Mistake Detection via Action Effect Modeling: This paper proposes a dual-branch multimodal supervision framework for action effect modeling, combining a visual branch (object state and spatial relation features) with a text branch (GPT-4o-generated scene graphs). Learnable effect tokens distill external supervision signals, achieving state-of-the-art mistake detection on egocentric procedural videos.
Reasoning-Driven Multimodal LLM for Domain Generalization: This paper proposes RD-MLDG — the first framework to incorporate MLLM reasoning chains into domain generalization. It constructs the DomainBed-Reasoning dataset, systematically analyzes two core challenges of reasoning supervision (optimization gap + reasoning pattern mismatch), and addresses them jointly via MTCT (Multi-Task Cross-Training) and SARR (Self-Aligned Reasoning Regularization), achieving an average accuracy of 86.89% across four standard DG benchmarks — substantially surpassing GPT-4o (83.46%) and all CLIP/ViT-based methods.
Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks: This paper introduces the Ref-Adv benchmark, constructed via a pipeline of hard distractor pairing + LLM-assisted minimally sufficient expression generation + three-annotator unanimous verification. The benchmark eliminates "grounding shortcuts" present in classical REC datasets. Across 13 contemporary MLLMs — including GPT-4o, Gemini 2.5, and Qwen2.5-VL-72B — accuracy drops dramatically from 90%+ on RefCOCO(+/g) to 50–68% on Ref-Adv, systematically exposing severe deficiencies in complex visual reasoning and precise grounding.
Revisit Visual Prompt Tuning: The Expressiveness of Prompt Experts: This paper reveals the limitations of VPT from a Mixture-of-Experts (MoE) perspective — prompt experts are input-agnostic constant functions with limited expressiveness — and proposes VAPT, which employs token-wise projectors and a shared feature projector to make prompt experts input-adaptive. VAPT achieves superior performance with fewer parameters and is supported by theoretical guarantees on optimal sample efficiency.
Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes: This paper proposes MV-RoboBench, the first benchmark integrating multi-view spatial reasoning with robotic manipulation tasks, systematically evaluating 40+ VLMs (open-source, closed-source, and reasoning-enhanced). The best-performing model, GPT-5, achieves only 56.4% accuracy, far below the human baseline of 91.0%. The study further reveals a positive correlation between spatial and robotic reasoning, and that performance on single-view benchmarks does not reliably transfer to multi-view settings.
Self-Aug: Query and Entropy Adaptive Decoding for Large Vision-Language Models: This paper proposes Self-Aug, a training-free decoding strategy that employs Self-Augmentation Selection (SAS) Prompting to enable LVLMs to leverage their own parametric knowledge for dynamically selecting query-semantically-aligned visual augmentations. It further introduces the Sparsity Adaptive Truncation (SAT) algorithm, which exploits the full entropy of the output distribution to dynamically regulate candidate token set size. Self-Aug consistently outperforms existing contrastive decoding methods across 5 LVLMs and 7 benchmarks.
Self-Evolving Vision-Language Models for Image Quality Assessment via Voting and Ranking: This paper proposes EvoQuality, a self-supervised iterative framework that generates pseudo-ranking labels via pairwise majority voting and employs GRPO for self-iterative optimization, enabling VLMs to autonomously improve their image quality perception without any human annotations. The framework achieves a 31.8% PLCC improvement in zero-shot settings and surpasses supervised SOTA on 5 out of 7 IQA benchmarks.
Shuffle-R1: Efficient RL Framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle: Shuffle-R1 is proposed as an RL training framework that addresses two key efficiency bottlenecks—Advantage Collapsing and Rollout Silencing—through Pairwise Trajectory Sampling (selecting high-contrast trajectory pairs) and Advantage-based Batch Shuffle (redistributing training batches by advantage values). The framework achieves a 22% improvement over the baseline on Geo3K and surpasses GPT-4o on MathVerse.
Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation: Inspired by the draft-then-verify paradigm of Speculative Decoding, this paper proposes Speculative Verdict (SV), which employs multiple lightweight VLMs to generate diverse reasoning paths as drafts, while a large model serves as the verdict to synthesize, verify, and correct them. Without any training, SV surpasses GPT-4o by 11.9% on information-intensive VQA and recovers correct answers in 47–53% of minority-correct cases.
SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward: This paper proposes SophiaVL-R1, which introduces a holistic-level thinking process reward into rule-based RL training of MLLMs. A Thinking Reward Model (TRM) is trained to evaluate reasoning quality along five dimensions (including logical soundness and redundancy). Trust-GRPO is proposed to compute a reliability weight $\gamma$ from the contrast of thinking rewards between correct and incorrect answer groups, mitigating reward hacking. A time-based annealing strategy $e^{-\text{steps}/T}$ gradually reduces the thinking reward contribution so that the model relies more on accurate rule-based rewards in later training. The resulting 7B model comprehensively outperforms LLaVA-OneVision-72B on multiple benchmarks, including MathVista (71.3%) and MMMU (61.3%).
Sparsity Forcing: Reinforcing Token Sparsity of MLLMs: This paper proposes Sparsity Forcing — a GRPO-based RL post-training framework that treats a sparse-attention MLLM as the policy model and the original MLLM as the reference model. Through multi-budget rollouts exploring different token retention thresholds $p$, and using a joint reward combining efficiency (token reduction rate) and performance (answer correctness) for within-group contrastive optimization, the method improves the token reduction rate of Qwen2/2.5-VL from 20% to 75% with minimal accuracy loss, achieving 3× memory reduction and 3.3× decoding speedup.
Spatial-DISE: A Unified Benchmark for Evaluating Spatial Reasoning in Vision-Language Models: This paper proposes Spatial-DISE, a unified spatial reasoning benchmark grounded in a cognitive-science-based 2×2 taxonomy (Intrinsic/Extrinsic × Static/Dynamic). The benchmark comprises 559 evaluation VQA pairs and 12K+ training instances. Evaluation across 32 state-of-the-art VLMs reveals a substantial gap between model performance and human-level capability, particularly on dynamic spatial reasoning tasks such as mental rotation and folding.
Spatial CAPTCHA: Generatively Benchmarking Spatial Reasoning for Human-Machine Differentiation: This paper proposes Spatial CAPTCHA, a novel human verification framework grounded in 3D spatial reasoning. It exploits fundamental capability gaps between humans and multimodal large language models (MLLMs) across geometric reasoning, perspective-taking, occlusion handling, and mental rotation tasks to distinguish humans from machines. The best-performing MLLM achieves only 31.0% Pass@1 accuracy, far below human performance.
Spatial Reasoning is Not a Free Lunch: A Controlled Study on LLaVA: Through controlled experiments within the LLaVA framework, this paper systematically investigates the effects of image encoder training objectives and 2D positional encoding on the spatial reasoning capabilities of VLMs. The study finds that encoder choice dominates spatial performance, AIMv2 yields the most consistent results, while improvements from 2D-RoPE are unstable—indicating that spatial reasoning failures are rooted in core design choices of current VLM pipelines.
SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?: This paper introduces SpatiaLab, a real-world spatial reasoning benchmark comprising 1,400 visual QA pairs spanning 30 subcategories across 6 major spatial task categories. Supporting both MCQ and open-ended evaluation formats, SpatiaLab reveals a substantial gap between the strongest current VLMs (InternVL3.5-72B: 54.93% MCQ) and humans (87.57%), with the gap widening further under open-ended settings.
SpectralGCD: Spectral Concept Selection and Cross-modal Representation Learning for Generalized Category Discovery: SpectralGCD represents images as CLIP cross-modal image-text similarity vectors (i.e., mixtures of semantic concepts), employs spectral filtering to automatically select task-relevant concepts, and applies forward-backward knowledge distillation to preserve semantic quality. The method achieves a new multimodal GCD state of the art across six benchmarks at a training cost comparable to unimodal approaches.
SpectralGCD: Spectral Concept Selection and Cross-modal Representation Learning for Generalized Category Discovery: This paper proposes SpectralGCD, which represents images as semantic mixtures over a CLIP concept dictionary (i.e., cross-modal similarity vectors), employs spectral filtering to automatically select task-relevant concepts, and incorporates forward-reverse knowledge distillation to preserve semantic quality. The method achieves multimodal state-of-the-art across six benchmarks at a computational cost comparable to unimodal approaches.
SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs: This paper introduces SpinBench, a cognitively grounded diagnostic benchmark that systematically evaluates spatial reasoning in 37 VLMs through 7 progressively structured task categories—ranging from object identity recognition to perspective taking—revealing systemic deficiencies including egocentric bias and weak rotation understanding.
Steering and Rectifying Latent Representation Manifolds in Frozen Multi-Modal LLMs for Video Anomaly Detection: This paper proposes SteerVAD, a framework that identifies "latent anomaly expert" (LAE) attention heads within a completely frozen multimodal large language model (MLLM) and dynamically steers their representation manifolds via a hierarchical meta-controller, achieving tuning-free video anomaly detection SOTA with only 1% of training data.
TableDART: Dynamic Adaptive Multi-Modal Routing for Table Understanding: This paper proposes TableDART, which employs a lightweight MLP gating network with only 2.59M parameters to dynamically select the optimal processing path (Text-only / Image-only / Fusion) for each query-table pair. By reusing frozen unimodal expert models and introducing an LLM Agent for cross-modal fusion, TableDART achieves an average improvement of 4.02% over the strongest MLLM baseline HIPPO across 7 table understanding benchmarks, while reducing inference latency by 24.5%.
ThinkOmni: Lifting Textual Reasoning to Omni-modal Scenarios via Guidance Decoding: ThinkOmni is a training-free framework that leverages a text-only large reasoning model (LRM) to guide an omni-modal LLM (OLLM) during decoding via Stepwise Contrastive Scaling, which adaptively balances perception and reasoning signals. The method achieves 70.2% on MathVista and 75.5% on MMAU, matching or surpassing RFT-based approaches.
Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs: This paper proposes VC-STaR (Visual Contrastive Self-Taught Reasoner), motivated by the observation that VLMs perceive visual content more accurately when comparing two similar images. A contrastive self-improvement framework is designed: contrastive VQA pairs are constructed to elicit more faithful visual analysis from the model, and an LLM integrates this contrastive analysis into reasoning chains, yielding the high-quality visual reasoning dataset VisCoR-55K. Fine-tuning on this dataset achieves +5.7% on MMVP and +3.2% on Hallusion.
U-MARVEL: Unveiling Key Factors for Universal Multimodal Retrieval via Embedding Learning: This work systematically ablates the design space of MLLM embedding learning, revealing key factors such as bidirectional attention + mean pooling outperforming the mainstream last-token approach, and learnable temperature being severely underestimated. Based on these findings, the authors construct U-MARVEL, a three-stage framework (progressive transition → filtered hard negatives → reranker distillation), achieving 63.2% Avg on M-BEIR with a single model, substantially surpassing existing SOTA, while also leading on zero-shot CIR and T2V transfer.
Unified Vision-Language Modeling via Concept Space Alignment: This paper proposes v-Sonar, which post-hoc aligns a visual encoder to the SONAR text embedding space, enabling the Large Concept Model (LCM) trained in the SONAR space to handle visual inputs in a zero-shot manner. Through instruction fine-tuning, v-Sonar is extended into v-LCM, which surpasses existing VLMs in 61 out of 62 languages.
UniHM: Unified Dexterous Hand Manipulation with Vision Language Model: This paper proposes UniHM, the first unified language-conditioned dexterous hand manipulation framework. It maps heterogeneous robotic hands into a shared discrete space via a morphology-agnostic VQ codebook, leverages a VLM for instruction-driven manipulation sequence generation, and ensures physical feasibility through physics-guided dynamic refinement.
VidGuard-R1: AI-Generated Video Detection and Explanation via Reasoning MLLMs and RL: VidGuard-R1 is the first video authenticity detector that fine-tunes an MLLM with GRPO (Group Relative Policy Optimization). By constructing a 140K shortcut-free real/fake video dataset and designing two specialized reward mechanisms—temporal artifact reward and diffusion-step quality reward—it achieves 86.17% accuracy on its in-house dataset and 95%+ zero-shot SOTA performance on GenVidBench and GenVideo benchmarks, while generating interpretable chain-of-thought reasoning.
VisioMath: Benchmarking Figure-based Mathematical Reasoning in LMMs: This paper introduces VisioMath, a benchmark comprising 1,800 K-12 mathematics problems in which all answer choices consist of highly visually similar figures. It reveals a core weakness of LMMs in multi-image–text alignment, and explores three alignment strategies that achieve up to +12.6% accuracy improvement.
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models: This paper proposes Vision-R1, which constructs 200K high-quality multimodal CoT data via Modality Bridging for cold-start initialization, followed by Progressive Thinking Suppression Training (PTST) combined with GRPO reinforcement learning. At the 7B parameter scale, Vision-R1 achieves multimodal mathematical reasoning performance approaching OpenAI O1.
Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play: This paper proposes Vision-Zero, the first annotation-free gamified self-play framework for VLMs. By casting visual reasoning as a "Who is the Spy?"-style game and combining it with the Iterative-SPO training algorithm, Vision-Zero achieves scalable self-improvement and surpasses SOTA methods trained on human-annotated data across reasoning, chart understanding, and vision-centric tasks.
VisJudge-Bench: Aesthetics and Quality Assessment of Visualizations: This paper introduces VisJudge-Bench, the first comprehensive benchmark for aesthetics and quality assessment of data visualizations (3,090 samples, 32 chart types), and trains the VisJudge model, which reduces MAE by 23.9% compared to GPT-5 and improves agreement with human experts by 60.5%.
Visual Prompt-Agnostic Evolution: This paper proposes Prompt-Agnostic Evolution (PAE), which accelerates VPT convergence (average 1.41×) and improves accuracy by 1–3% across 25 datasets through frequency-aware task initialization (MPA) and a Koopman-Lyapunov dynamical system (KLD) for cross-layer prompt coupling. PAE is plug-and-play for various VPT variants and introduces no inference overhead.
Visual Symbolic Mechanisms: Emergent Symbol Processing in Vision Language Models: This paper discovers that VLMs internally develop a three-stage symbolic processing mechanism (ID retrieval → ID selection → feature retrieval) that uses content-agnostic spatial position indices (position IDs) to solve the visual binding problem, and demonstrates that binding errors can be directly traced to failures in these mechanisms.
VLM-SubtleBench: How Far Are VLMs from Human-Level Subtle Comparative Reasoning?: This paper introduces VLM-SubtleBench, a benchmark for evaluating vision-language models on subtle difference comparative reasoning, covering 10 difference types and 6 image domains (natural, gaming, industrial, aerial, medical, and synthetic). It reveals a performance gap of over 30% between VLMs and humans on spatial, temporal, and viewpoint reasoning tasks.
VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use: This paper proposes VTool-R1, the first framework that trains VLMs via reinforcement fine-tuning to generate interleaved textual and visual intermediate reasoning steps, enabling models to "think with images."
WebDS: An End-to-End Benchmark for Web-based Data Science: This paper introduces WebDS, the first end-to-end web-based data science benchmark comprising 870 tasks across 29 websites and 10 domains. The strongest evaluated agent (BrowserUse + GPT-4o) completes only 15% of tasks, while humans achieve 90%, revealing a substantial performance gap in realistic data science workflows.
Why Keep Your Doubts to Yourself? Trading Visual Uncertainties in Multi-Agent Bandit Systems: This paper proposes Agora, a framework that recasts multi-agent VLM coordination as a decentralized uncertainty trading market. Cognitive uncertainty is minted into quantifiable, three-dimensional tradable assets (perceptual / semantic / inferential), and efficient equilibrium allocation is achieved through a profit-driven trading protocol and a market-aware Thompson Sampling Broker. Agora consistently outperforms heuristic baselines across five multimodal benchmarks (e.g., +8.5% accuracy on MMMU with more than 3× cost reduction).
Why Keep Your Doubts to Yourself? Trading Visual Uncertainties in Multi-Agent Bandit Systems: This paper proposes Agora, a framework that recasts multi-agent VLM coordination as a decentralized uncertainty trading market. By decomposing epistemic uncertainty into tradable assets along three dimensions (perceptual / semantic / reasoning) and employing a profitability-driven trading protocol together with Thompson Sampling brokers, Agora achieves cost-aware optimal allocation, yielding up to +8.5% accuracy improvement with over 3× cost reduction across five multimodal benchmarks.
Why Reinforcement Fine-Tuning Preserves Prior Knowledge Better: A Data Perspective: Through a systematic study of how SFT and RFT affect prior knowledge using a jigsaw puzzle task, this paper reveals that the key to RFT avoiding catastrophic forgetting lies in data distribution rather than algorithmic differences — data sampled by RFT naturally aligns with the base model's probability landscape, causing less interference.
Zero-shot HOI Detection with MLLM-based Detector-agnostic Interaction Recognition: This paper proposes DA-HOI, a zero-shot HOI detection framework that fully decouples object detection from interaction recognition. It replaces conventional CLIP-based features with MLLM VQA capabilities for interaction recognition. The core contributions are deterministic generation (achieving 31.50 mAP training-free), spatial-aware pooling (incorporating spatial priors and cross-attention), and one-pass deterministic matching (reducing $M$ forward passes to one). DA-HOI comprehensively surpasses the state of the art across all four zero-shot settings on HICO-DET and supports plug-and-play detector substitution after training.

📦 Model Compression¶

A Fano-Style Accuracy Upper Bound for LLM Single-Pass Reasoning in Multi-Hop QA: This paper derives a Fano-style accuracy upper bound for LLM single-pass reasoning on multi-hop QA using information theory, revealing a "cliff-like" accuracy collapse when task information demand exceeds model output capacity. Based on this analysis, the authors design a multi-turn reasoning framework, InfoQA, which overcomes the single-pass bottleneck via capacity-aware decomposition, dependency-explicit workflows, and iterative query compression.
A Recovery Guarantee for Sparse Neural Networks: This paper establishes the first sparse recovery guarantee for ReLU neural networks: for two-layer scalar-output networks with Gaussian random training data, an iterative hard thresholding (IHT) algorithm based on convex reformulation can exactly recover sparse network weights, with memory requirements scaling only linearly in the number of nonzero weights.
A State-Transition Framework for Efficient LLM Reasoning: This paper proposes an efficient reasoning framework that models the LLM reasoning process as a state-transition process. It uses Linear Attention to compress information from historical reasoning steps into a state matrix, reducing attention complexity from $O(C^2)$ to $O(C)$ and KV cache from $O(C)$ to $O(1)$, while preserving the full CoT sequence and maintaining reasoning capability. An additional momentum strategy mitigates the overthinking problem caused by noisy reasoning steps.
A universal compression theory for lottery ticket hypothesis and neural scaling laws: This paper proves a Universal Compression Theorem: any permutation-invariant function over $d$ objects can be asymptotically compressed to polylog(d) objects with error approaching zero (which is the optimal compression rate). From this theorem, the authors directly derive: (1) a proof of the dynamic lottery ticket hypothesis — any network can be compressed to polylogarithmic width while preserving its learning dynamics; (2) a dataset compression result — any dataset can be compressed to polylogarithmic size while preserving the loss landscape; and (3) an acceleration of power-law scaling laws to arbitrarily fast decay rates.
ABBA-Adapters: Efficient and Expressive Fine-Tuning of Foundation Models: This paper proposes ABBA adapters, which parameterize weight updates as the Hadamard product of two independently learnable low-rank matrices, $\Delta W = s(B_1A_1) \odot (B_2A_2)$. Under the same parameter budget, ABBA achieves an effective rank of $r_1 \cdot r_2$ compared to LoRA's $r$, representing a quadratic improvement. Through Khatri-Rao reconstruction, ABBA maintains memory efficiency comparable to LoRA, and significantly outperforms existing PEFT methods on arithmetic and commonsense reasoning tasks.
ACPBench Hard: Unrestrained Reasoning about Action, Change, and Planning: This paper introduces ACPBench Hard — an open-ended generative planning reasoning benchmark comprising 8 task types grounded in PDDL formal systems (13 domains × 8 tasks = 1,040 questions), equipped with symbolic validators that provide rigorous correctness guarantees. A systematic evaluation of 15 LLMs reveals that even the strongest reasoning model, o1-preview, achieves accuracy ≤66% on half the tasks, and all models fail almost completely on the most fundamental task of enumerating applicable actions, exposing fundamental deficiencies in current LLMs' planning reasoning capabilities.
Adaptive Width Neural Networks: This paper proposes the AWN framework, which automatically learns the unbounded width (number of neurons) of each layer during training via variational inference. A monotonically decreasing importance function imposes a soft ordering on neurons, enabling width to adapt to task difficulty and supporting zero-cost post-training truncation for compression.
AgilePruner: An Empirical Study of Attention and Diversity for Adaptive Visual Token Pruning in LVLMs: Through systematic empirical analysis using erank (effective rank) and attention entropy, this work reveals the complementary nature of attention-based and diversity-based visual token pruning methods — attention methods suppress hallucinations but suffer from limited coverage, while diversity methods achieve broad coverage but tend to introduce hallucinations. Based on these findings, AgilePruner is proposed to adaptively switch pruning strategies according to image complexity, achieving robust performance across 9 benchmarks.
AMiD: Knowledge Distillation for LLMs with α-mixture Assistant Distribution: This paper proposes the α-mixture assistant distribution and a unified distillation framework, AMiD. By introducing a new design variable α that controls the geometric shape of the interpolation path between teacher and student distributions, AMiD generalizes existing assistant distribution methods (m-mixture and e-mixture are special cases at α=±1), proves optimality guarantees under arbitrary divergences and α values, and achieves state-of-the-art performance on multiple LLM distillation benchmarks.
AnyBCQ: Hardware Efficient Flexible Binary-Coded Quantization for Multi-Precision LLMs: This paper proposes AnyBCQ, a multi-precision LLM quantization framework based on Binary-Coded Quantization (BCQ). By progressively expanding precision (freezing existing bit-planes and appending residual bit-planes), a single model supports dynamic switching between 2-bit and 4-bit precision. Dedicated CUDA kernels perform computation directly at the bit-plane level, eliminating lookup-table and transpose overhead. At 2-bit, AnyBCQ substantially outperforms Any-Precision LLM in accuracy (MMLU 35.3% vs. 24.7%) and achieves up to 3.0× throughput over FP16.
BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models: This paper proposes BeyondBench, an evaluation framework that algorithmically generates mathematical problems on-the-fly (44 tasks / 117 variants / 3 difficulty levels) to ensure each evaluation instance is free from training data contamination. It evaluates 101 language models (0.5B–141B parameters), finding that even the strongest models achieve only 56% accuracy on the Hard Suite, with substantial performance drops when tools are unavailable.
Boomerang Distillation Enables Zero-Shot Model Size Interpolation: This paper proposes the Boomerang Distillation paradigm: train a single small student model, then construct an entire family of intermediate-sized models at zero additional training cost by progressively grafting teacher transformer layer blocks back onto the student. The resulting models interpolate smoothly in performance between the student and teacher, matching or even surpassing independently distilled models of equivalent size.
Boosting Entropy with Bell Box Quantization: This paper proposes Bell Box Quantization (BBQ), the first quantization method that simultaneously satisfies information-theoretic optimality (ITO) and compute-efficiency. The core insight is the domain-agnosticity of learning—the output domain of a quantizer need not coincide with its input domain. BBQ performs ITO quantization in the input domain to maximize entropy, then maps to hardware-acceleratable data types in the output domain, achieving comprehensive improvements over QuEST and LSQ in 1–4 bit QAPT settings.
Bridging Kolmogorov Complexity and Deep Learning: Asymptotically Optimal Description Length Objectives for Transformers: Starting from Kolmogorov complexity theory, this paper proposes a theoretical framework of "asymptotically optimal description length objectives," proves the existence of such objectives for Transformers via a novel proof of their computational universality, and empirically validates the framework through a differentiable variational objective based on an adaptive Gaussian mixture prior, revealing significant optimization challenges.
COMI: Coarse-to-fine Context Compression via Marginal Information Gain: This paper proposes COMI, a coarse-to-fine adaptive context compression framework based on Marginal Information Gain (MIG = query relevance − semantic redundancy). At a 32× compression ratio, COMI improves NaturalQuestions EM by approximately 25 points over the second-best method, with the core insight being the joint optimization of relevance and diversity among retained information.
Compute-Optimal Quantization-Aware Training: Through 757 QAT experiments spanning 86M–2.2B parameters and 1–6 bits, this paper demonstrates that the optimal QAT training fraction grows with total compute budget—contradicting the previously held belief that 10% is universally optimal—and proposes the tokens-per-parameter-byte statistic along with a new loss scaling law to accurately predict the optimal QAT allocation strategy and final loss across all configurations.
ConFu: Contemplate the Future for Better Speculative Sampling: ConFu introduces contemplate tokens into the draft model of speculative decoding, enabling it to anticipate the target model's future generation direction. Combined with a MoE dynamic mechanism and anchor-point sampling training, ConFu achieves 8–11% improvements in acceptance rate and generation speed over EAGLE-3.
Cross-Domain Lossy Compression via Rate- and Classification-Constrained Optimal Transport: This paper formalizes cross-domain lossy compression — where the encoder observes a degraded source and the decoder reconstructs samples from a different target distribution — as an optimal transport problem subject to dual constraints on rate and classification loss. Closed-form DRC/RDC and DRPC tradeoff functions are derived for Bernoulli sources (Hamming distortion) and Gaussian sources (MSE). The theoretical predictions are validated against the empirical behavior of deep end-to-end compression models on super-resolution, denoising, and inpainting tasks.
Cut Less, Fold More: Model Compression through the Lens of Projection Geometry: This paper unifies structured pruning and model folding under an orthogonal projection framework—pruning as coordinate-aligned projection and folding as clustering subspace projection—and proves that folding yields strictly smaller parameter reconstruction error under a rank-one difference condition. Validation across 1,000+ checkpoints demonstrates that folding consistently outperforms pruning at medium-to-high compression ratios.
Dataset Color Quantization: A Training-Oriented Framework for Dataset-Level Compression: This paper proposes the Dataset Color Quantization (DCQ) framework, which reduces color redundancy at the dataset level through three mechanisms — chromaticity-aware clustering, attention-guided palette allocation, and texture-preserved palette optimization — achieving storage compression while maintaining training performance.
Dataset Distillation as Pushforward Optimal Quantization: This work reformulates decoupled dataset distillation as an optimal quantization problem, proves that latent-space clustering with learned weights via a diffusion prior can converge to approximate the true data distribution, and proposes the DDOQ algorithm, which surpasses baselines such as D4M on ImageNet-1K with minimal additional computation.
DiffVax: Optimization-Free Image Immunization Against Diffusion-Based Editing: DiffVax trains a feed-forward immunizer (UNet++) that generates imperceptible adversarial perturbations for arbitrary images in a single forward pass (~70ms), causing diffusion-based malicious editing to fail. Compared to prior per-image optimization methods, DiffVax achieves a 250,000× speedup and is the first to extend immunization to video content.
Distillation of Large Language Models via Concrete Score Matching: This paper proposes Concrete Score Distillation (CSD), a knowledge distillation loss for LLMs grounded in discrete score matching. By matching the relative logit differences between all vocabulary token pairs across the student and teacher, CSD simultaneously overcomes the softmax-smoothing problem and the solution-space restriction inherent in direct logit distillation.
Distilling and Adapting: A Topology-Aware Framework for Zero-Shot Interaction Prediction in Multiplex Biological Networks: This paper proposes CAZI-MBN, a framework that integrates domain-specific LLM sequence embeddings, a topology-aware unified graph tokenizer, context-aware cross-layer attention, and teacher-student distillation to enable zero-shot interaction prediction for unseen entities in multiplex biological networks, achieving AUROC improvements of 3.1–20.4% over the best baseline across 5 benchmark datasets.
Draft-based Approximate Inference for LLMs: This paper proposes the Draft-based Approximate Inference framework, which leverages lookahead predictions from a lightweight draft model to more accurately estimate token/KV pair importance. The framework comprises three methods — SpecKV (KV cache dropping), SpecPC (prompt compression), and SpecKV-PC (cascaded compression) — and consistently outperforms existing baselines on long-context benchmarks.
Efficient Reasoning with Balanced Thinking: This paper proposes ReBalance, a training-free framework that simultaneously mitigates overthinking and underthinking in large reasoning models (LRMs) via confidence-guided dynamic hidden-state steering vectors, achieving joint improvements in both reasoning efficiency and accuracy.
Embedding Compression via Spherical Coordinates: This paper proposes an embedding compression method based on spherical coordinate transformation. By exploiting the mathematical property that angular coordinates of high-dimensional unit vectors concentrate near $\pi/2$, the method substantially reduces the entropy of the exponent bits and high-order mantissa bits in IEEE 754 floating-point representations, achieving a 1.5× compression ratio — a 25% improvement over the best lossless methods — with reconstruction error below float32 machine precision.
ES-dLLM: Efficient Inference for Diffusion Large Language Models by Early-Skipping: To address the extensive computational redundancy in diffusion large language model (dLLM) inference, this paper proposes ES-dLLM, a training-free Early-Skipping acceleration framework. By estimating token importance and skipping low-importance positions in early layers, ES-dLLM achieves 5.6×–16.8× speedup on LLaDA-8B and Dream-7B without degrading generation quality.
Evolution and compression in LLMs: On the emergence of human-aligned categorization: Through the Information Bottleneck (IB) framework and the Iterated In-Context Language Learning (IICLL) paradigm, this paper demonstrates that LLMs can spontaneously develop category structures that are highly aligned with human semantic categorization systems and achieve near-optimal compression efficiency, without having been trained on any IB objective.
FASA: Frequency-aware Sparse Attention: This paper identifies a functional sparsity in RoPE attention at the frequency chunk (FC) level—fewer than 1% of "dominant FCs" suffice to approximate the token selection behavior of full attention heads. Building on this finding, the authors propose FASA, a training-free framework that employs a two-stage strategy (dominant FCs predict token importance → full attention is computed only over important tokens), achieving 8× memory compression and 2.6× inference speedup with negligible quality loss.
Fine-tuning Quantized Neural Networks with Zeroth-order Optimization: This paper proposes QZO, a method that estimates gradients via zeroth-order perturbations applied to quantization scaling factors (rather than discrete weights), and stabilizes training with directional derivative clipping (DDC). QZO enables memory-efficient fine-tuning of 4-bit/2-bit LLMs with over 18× total memory reduction.
FlyPrompt: Brain-Inspired Random-Expanded Routing with Temporal-Ensemble Experts for General Continual Learning: Inspired by the neurobiology of the Drosophila mushroom body—specifically its sparse random expansion and modular integration mechanisms—FlyPrompt is proposed as a framework for General Continual Learning (GCL). It introduces a Random-Expanded Analytic Router (REAR) for non-iterative expert selection, combined with a multi-timescale EMA output head Temporal Ensemble (TE²) to enhance expert capacity, achieving gains of up to 11.23%/12.43%/7.62% on CIFAR-100/ImageNet-R/CUB-200.
FlyPrompt: Brain-Inspired Random-Expanded Routing with Temporal-Ensemble Experts for General Continual Learning: Inspired by the mushroom body circuitry of Drosophila, FlyPrompt decomposes General Continual Learning (GCL) into two sub-problems—expert routing and expert capacity—and addresses them respectively with a Random Expanded Analytic Router (REAR) and Temporal-Ensemble Experts (TE2), achieving improvements of 11.23% / 12.43% / 7.62% on CIFAR-100 / ImageNet-R / CUB-200.
FreqKV: Key-Value Compression in Frequency Domain for Context Window Extension: This paper proposes FreqKV, a parameter-free and architecture-agnostic KV cache compression method that iteratively compresses KV states in the frequency domain by retaining low-frequency components and discarding high-frequency ones. With only lightweight fine-tuning on 8K-length sequences, FreqKV extends the context window of LLaMA-2-7B to 256K while maintaining stable perplexity.
Grounding and Enhancing Informativeness and Utility in Dataset Distillation: This paper proposes InfoUtil, a framework that maximizes sample informativeness via game-theoretic Shapley Values (to identify the most critical patches) and maximizes sample utility via gradient norms (to select the most training-valuable samples), achieving a 6.1% improvement over the previous SOTA on ImageNet-1K.
HiFo-Prompt: Prompting with Hindsight and Foresight for LLM-based Automatic Heuristic Design: This paper proposes the HiFo-Prompt framework, which enhances LLM-driven Automatic Heuristic Design (AHD) through two collaborative modules—Hindsight (a retrospective insight pool) and Foresight (a prospective evolutionary navigator)—achieving substantial improvements over existing methods on tasks such as TSP and FSSP.
Highly Efficient and Effective LLMs with Multi-Boolean Architectures: This paper proposes a novel framework that represents LLM weights as multi-kernel Boolean parameters, enabling, for the first time, direct finetuning of large language models entirely within the Boolean domain—without requiring full-precision latent weights. The approach simultaneously surpasses existing ultra-low-bit quantization and binarization methods in both representational capacity and computational efficiency.
IDER: IDempotent Experience Replay for Reliable Continual Learning: This paper introduces idempotence into continual learning, enforcing output self-consistency during new task acquisition via two components—a Standard Idempotent Module and an Idempotent Distillation Module—simultaneously improving prediction reliability (reduced calibration error) and significantly mitigating catastrophic forgetting.
Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning: This paper proposes TIR-Judge, an end-to-end RL framework that trains LLM judge models to interleave reasoning and code execution during evaluation. With only 8B parameters, TIR-Judge surpasses 32B reasoning reward models across 7 public benchmarks; its distillation-free variant, TIR-Judge-Zero, achieves further self-bootstrapped improvement.
InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models: InftyThink is proposed as a new paradigm that transforms monolithic long-form reasoning into iterative short-form reasoning with intermediate summarization. Without modifying model architecture, it achieves theoretically unbounded reasoning depth and significantly reduced computational cost, yielding an 11% improvement for Qwen2.5-Math-7B on AIME24.
Is Finer Better? The Limits of Microscaling Formats in Large Language Models: This paper identifies and explains a counterintuitive anomaly in microscaling quantization — namely that reducing block size below a certain threshold increases quantization error for narrow-distribution tensors due to the limited dynamic range of the FP8 UE4M3 scale format — and proposes FP8 UE5M3 as a hardware-friendly solution.
KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models: This paper proposes KBVQ-MoE, the first vector quantization framework specifically designed for MoE architectures. It eliminates inter-expert redundancy sharing (IDRE) via KLT-guided SVD and stabilizes outputs through bias-corrected output stabilization (BCOS), achieving 10%+ accuracy improvement over existing methods at 2-bit quantization.
Knowledge Fusion of Large Language Models Via Modular Skillpacks: This paper proposes GraftLLM, a framework that extracts capabilities from heterogeneous source models into compact, transferable "SkillPacks" (modular skill packages). Through a module-aware adaptive compression strategy that stores parameter deltas, GraftLLM supports knowledge transfer, heterogeneous model fusion, and continual learning without forgetting, significantly outperforming existing PEFT and parameter merging methods across multiple settings.
Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models: This paper proposes Landscape of Thoughts (LoT), the first tool to visualize LLM reasoning trajectories as two-dimensional terrain maps. By encoding intermediate states via perplexity-based features and projecting them with t-SNE, LoT reveals reasoning behavior patterns and can be adapted as a lightweight verifier to improve reasoning accuracy and test-time scaling.
LD-MoLE: Learnable Dynamic Routing for Mixture of LoRA Experts: This paper proposes LD-MoLE, which replaces conventional TopK routing with a Sparsegen closed-form projection to achieve differentiable, dynamic, token-adaptive LoRA expert assignment. A lightweight MLP predicts sparse factors, and an analytic sparsity loss is employed. LD-MoLE outperforms fixed-routing and ReLU-routing baselines across multiple benchmarks.
LightMem: Lightweight and Efficient Memory-Augmented Generation: This paper proposes LightMem, a three-stage lightweight memory system inspired by the human Atkinson–Shiffrin memory model. Through three modules — cognitive sensory memory pre-compression, topic-aware short-term memory consolidation, and offline sleep-time updating — LightMem achieves up to 7.7% accuracy improvement on LongMemEval while reducing token consumption by up to 38×.
LLM DNA: Tracing Model Evolution via Functional Representations: Drawing an analogy from biological DNA, this work formally defines LLM DNA as a low-dimensional bi-Lipschitz representation of a model's functional behavior, proves that it satisfies the properties of heritability and genetic determinism, and designs a training-free RepTrace pipeline to extract DNA from 305 LLMs and construct their evolutionary tree.
LLMs Encode Their Failures: Predicting Success from Pre-Generation Activations: This paper demonstrates that LLMs encode model-specific success probability information in their pre-generation internal activations. Training linear probes to extract this signal enables efficient model routing that matches the accuracy of the strongest model while reducing inference cost by 70% on benchmarks such as MATH.
LoFT: Low-Rank Adaptation That Behaves Like Full Fine-Tuning: This paper proposes LoFT, a low-rank adaptation method composed of six building blocks that aligns the internal optimizer dynamics (momentum and second-order moments) with those of full fine-tuning. In the full-rank limit, LoFT exactly recovers AdamW, and it substantially closes the performance gap between LoRA and full fine-tuning across multiple benchmarks.
LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation: This paper proposes LookaheadKV, which predicts true response attention importance scores via learnable lookahead tokens and selectively activated LoRA modules, achieving fast and accurate KV cache eviction without draft generation. The method outperforms existing approaches on multiple long-context benchmarks and reduces eviction overhead by up to 14.5×.
Memba: Membrane-driven Parameter-Efficient Fine-Tuning for Mamba: This paper proposes Memba, a parameter-efficient fine-tuning method inspired by biological neuron membrane potentials. By introducing Leaky Integration Membrane (LIM) neurons into the gating branch of Mamba, Memba achieves temporal adaptability, combined with LoRA placement optimization and cross-layer membrane transfer. With minimal trainable parameters, Memba surpasses existing Mamba PEFT methods on both language and vision tasks.
MobileLLM-R1: Exploring the Limits of Sub-Billion Language Model Reasoners with Open Training Recipes: Through careful data selection and an adaptive mixing strategy, MobileLLM-R1-950M is pretrained on only 4.2T tokens (11.7% of Qwen3's token budget) and matches or surpasses Qwen3-0.6B on reasoning benchmarks such as AIME, while fully open-sourcing both data sources and training recipes.
Modality-free Graph In-context Alignment: This paper proposes MF-GIA, the first graph in-context learning framework that simultaneously satisfies three conditions: no post-training, cross-domain alignment, and modality-agnosticism. By capturing domain characteristics via gradient fingerprints, aligning features and labels through FiLM-conditioned transformations, MF-GIA achieves state-of-the-art performance on few-shot tasks across multiple graph domains.
MoNE: Replacing Redundant Experts with Lightweight Novices for Structured Pruning of MoE: This paper proposes MoNE (Mixture-of-Novices-and-Experts), which identifies redundant experts via joint evaluation of access frequency and output variance, and replaces them with their output mean vectors ("novices"). MoNE achieves more effective and robust compression than existing pruning methods across 5 MoE models, with an average accuracy drop of only 0.14 at a 25% pruning ratio.
Multi-View Encoders for Performance Prediction in LLM-Based Agentic Workflows: This paper proposes Agentic Predictor, a multi-view workflow encoding framework that jointly models graph structure, code semantics, and prompt information to predict the performance of LLM-based agentic workflows, substantially reducing costly trial-and-error evaluations.
Null-Space Filtering for Data-Free Continual Model Merging: Preserving Stability, Promoting Plasticity: This paper proposes NUFILT, a framework that exploits the geometric property of approximate alignment between task vectors and representation subspaces. By applying null-space filtering to suppress interference with previous tasks and projection-aware LoRA to restore plasticity for new tasks, NUFILT achieves continual model merging without accessing any data. It outperforms OPCM by 4–8% on vision, NLP, and multimodal benchmarks, approaching the upper bound of individual fine-tuning.
Parallel Token Prediction for Language Models: This paper proposes Parallel Token Prediction (PTP), which relocates sampling randomness from post-processing to model inputs via auxiliary variables, rendering future tokens deterministic functions of those variables and enabling joint prediction of multiple tokens in a single forward pass.
ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference: ParoQuant is proposed to eliminate weight outliers via hardware-efficient and optimizable independent Givens rotations combined with channel scaling, achieving high-accuracy, low-overhead 4-bit weight quantization for reasoning LLMs.
PASER: Post-Training Data Selection for Efficient Pruned Large Language Model Recovery: This paper proposes PASER, a post-training data selection method for recovering pruned LLMs. It identifies capability-relevant instruction subsets via manifold learning and spectral clustering, and adaptively allocates data budgets according to the degree of capability degradation. Using only 4%–20% of the original data, PASER significantly outperforms full-data recovery.
Pedagogically-Inspired Data Synthesis for Language Model Knowledge Distillation: This paper proposes the IOA (Identifier-Organizer-Adapter) framework, which draws on Bloom's mastery learning principles and Vygotsky's Zone of Proximal Development (ZPD) theory to achieve pedagogically-driven LLM knowledge distillation through three stages: diagnosing knowledge deficiencies, designing progressive curricula, and adapting to cognitive capacity.
π-Flow: Policy-Based Few-Step Generation via Imitation Distillation: This paper proposes π-Flow, which modifies the output layer of a student flow model to predict a policy that generates dynamic flow velocities through multiple sub-steps within a single network evaluation, enabling precise ODE integration. Combined with imitation distillation—matching teacher velocities along the student's own trajectories—the method achieves stable and scalable few-step generation without the quality–diversity trade-off.
PTQ4ARVG: Post-Training Quantization for AutoRegressive Visual Generation Models: PTQ4ARVG is proposed as the first systematic PTQ framework for autoregressive visual generation (ARVG) models. It addresses three ARVG-specific quantization challenges via Gain-Projected Scaling (GPS), Static Token-Wise Quantization (STWQ), and Distribution-Guided Calibration (DGC).
QKV Projections Require a Fraction of Their Memory: This paper proposes PAMM (Point-Approximate Matrix Multiplication), an activation compression technique that approximates QKV projection layer activations by randomly selecting a small number of representative tokens, achieving up to 512× compression without degrading model performance.
Rectified Decoupled Dataset Distillation: A Closer Look for Fair and Comprehensive Evaluation: This paper proposes RD3 (Rectified Decoupled Dataset Distillation), systematically demonstrating that performance discrepancies among existing decoupled dataset distillation methods stem primarily from inconsistent post-evaluation settings rather than differences in distillation quality. By establishing a unified and fair evaluation framework, the reported 27.3% performance gap is corrected to 6.7%.
Reference-Guided Machine Unlearning: This paper proposes ReGUn (Reference-Guided Unlearning), which leverages an independent held-out dataset as a reference standard for "unseen behavior." Through class-conditional distillation, the model's behavior on forget data is aligned to that on truly unseen data, achieving a superior forgetting–utility trade-off.
Rethinking Continual Learning with Progressive Neural Collapse: This paper proposes the ProNC framework, which replaces fixed pre-defined ETFs with a progressively expanding Equiangular Tight Frame (ETF) target to achieve a balance between maximal inter-class separation and minimal forgetting in continual learning.
Revisiting Weight Regularization for Low-Rank Continual Learning: This paper reintroduces Elastic Weight Consolidation (EWC) into low-rank continual learning by estimating the Fisher Information Matrix in the full-dimensional space to regularize a shared LoRA module, achieving effective forgetting mitigation under constant memory overhead.
S2R-HDR: A Large-Scale Rendered Dataset for HDR Fusion: This paper proposes S2R-HDR, the first large-scale high-quality synthetic HDR fusion dataset (24,000 samples), and introduces S2R-Adapter, a domain adaptation method that bridges the synthetic-to-real gap, achieving state-of-the-art HDR fusion performance on real-world datasets.
Scaling Reasoning Hop Exposes Weaknesses: Demystifying and Improving Hop Generalization in Large Language Models: This paper systematically uncovers the internal mechanism underlying LLM failures in reasoning hop generalization — namely, attention head competition between correct and erroneous reasoning trajectories — and proposes TCR (Test-time Correction of Reasoning), which dynamically identifies and deactivates erroneous processing heads (ep heads) at inference time to correct reasoning errors, achieving an average accuracy improvement of 5–7%.
SeeDNorm: Self-Rescaled Dynamic Normalization: This paper proposes SeeDNorm, an adaptive dynamic normalization layer that conditions the scaling coefficients on the input itself, thereby preserving input norm information in the forward pass while retaining RMSNorm-like adaptive gradient adjustment in the backward pass. With negligible additional parameters, SeeDNorm consistently outperforms RMSNorm, LayerNorm, and DyT on both language modeling and vision tasks.
SERE: Similarity-based Expert Re-routing for Efficient Batch Decoding in MoE Models: SERE is proposed to pre-compute an expert similarity matrix and dynamically re-route secondary experts to their most similar primary experts during batch decoding, achieving up to 2.0× speedup with negligible quality loss, accompanied by a plug-and-play vLLM CUDA kernel.
SFT Doesn't Always Hurt General Capabilities: Revisiting Domain-Specific Fine-Tuning in LLMs: This paper systematically revisits the impact of domain-specific SFT on the general capabilities of LLMs, demonstrating that using a smaller learning rate can substantially mitigate general capability degradation, and proposes Token-Adaptive Loss Reweighting (TALR), which further optimizes the trade-off between domain adaptation and general capability retention by adaptively down-weighting the loss of low-probability tokens.
Specialization after Generalization: Towards Understanding Test-Time Training in Foundation Models: Grounded in the Linear Representation Hypothesis (LRH), this paper proposes a theoretical framework termed specialization after generalization, providing the first systematic explanation of why TTT is effective under in-distribution settings. Foundation models suffer from concept superposition due to global underparameterization; TTT temporarily forgets irrelevant concepts to free model capacity, locally specializing to the small set of concepts relevant to the test task. The theory guarantees generalization even when the feature space is exponentially smaller than the concept space.
STAR: Similarity-guided Teacher-Assisted Refinement for Super-Tiny Function Calling Models: This paper proposes the STAR framework, which combines Constrained Knowledge Distillation (CKD) and Similarity-guided Reinforcement Learning (Sim-RL) to effectively transfer function calling capabilities from large models to super-tiny models at the 0.6B scale, achieving substantial improvements over baselines on BFCL and ACEBench.
Steering MoE LLMs via Expert (De)Activation: This paper proposes SteerMoE, which detects behavior-correlated experts via contrastive paired inputs and steers MoE LLM behavior at inference time by activating or deactivating specific experts (safety +20%, faithfulness +27%), while also exposing the fragility of safety alignment in MoE models (safety collapse −100%).
Stress-Testing Alignment Audits with Prompt-Level Strategic Deception: This paper constructs an automated prompt-level red-teaming pipeline (powered by Claude Opus 4.5) to augment situational awareness and strategic reasoning in existing fine-tuned model organisms, and stress-tests four black-box and white-box alignment auditing methods across six experimental settings. The pipeline successfully induces high-confidence incorrect guesses from all auditing methods and provides the first documented instance of prompt-level activation deception without any weight modification.
SwiReasoning: Switch-Thinking in Latent and Explicit for Pareto-Superior Reasoning: This paper proposes SwiReasoning, a training-free LLM reasoning framework that dynamically switches between explicit (chain-of-thought) and implicit (latent space) reasoning modes via entropy-trend-based block-level confidence estimation, achieving Pareto-superior improvements in both accuracy (+1.8%–3.1%) and token efficiency (+57%–79%).
Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation: This work reveals that EMA-based momentum updates are equivalent to gradient descent on an online linear regression objective, and builds upon this insight to propose LoRA-Pre — a method that compresses optimizer momentum via low-rank factorization for memory-efficient LLM pretraining and fine-tuning. LoRA-Pre achieves state-of-the-art performance across all model scales using only 1/8 the rank required by baseline methods.
Textual Equilibrium Propagation for Deep Compound AI Systems: This paper proposes Textual Equilibrium Propagation (TEP), a compound AI system optimization method grounded in local learning principles. Through a two-phase design consisting of a free phase and a nudged phase, TEP avoids gradient explosion/vanishing in global textual backpropagation and significantly outperforms TextGrad on deep workflows.
The Geometry of LLM Quantization: GPTQ as Babai's Nearest Plane Algorithm: This paper provides the first proof that GPTQ (executed in reverse order) is mathematically equivalent to Babai's nearest plane algorithm from classical lattice theory, thereby obtaining a geometric interpretation and layer-wise error upper bounds, upon which a clipping-free improved quantization method is designed.
The Lattice Geometry of Neural Network Quantization -- A Short Equivalence Proof of GPTQ and Babai's Algorithm: Independently of Chen et al. (2026), this paper provides a more concise and elegant proof that GPTQ is equivalent to Babai's nearest plane algorithm, and clarifies the prospect of lattice basis reduction for improving neural network quantization.
The Unseen Frontier: Pushing the Limits of LLM Sparsity with Surrogate-Free ADMM: This paper proposes Elsa, a method that directly solves sparsity-constrained optimization via surrogate-free ADMM, breaking the 50–60% "sparsity wall" bottleneck in LLM pruning and maintaining high model fidelity even at 90% sparsity.
TiTok: Transfer Token-level Knowledge via Contrastive Excess to Transplant LoRA: This paper proposes TiTok, a framework that enables efficient cross-model transfer of LoRA adapters via token-level contrastive excess scores, without requiring an auxiliary discriminator model. TiTok consistently outperforms TransLoRA and knowledge distillation baselines on reasoning and personalization tasks.
Token Distillation: Attention-Aware Input Embeddings for New Tokens: This paper proposes Token Distillation, a method that distills multi-subword interaction information encoded across all Transformer layers into a single token embedding, enabling high-quality initialization of new token embeddings without requiring a pretrained hypernetwork and outperforming existing approaches.
Topology and Geometry of the Learning Space of ReLU Networks: Connectivity and Size: From the perspectives of algebraic geometry and algebraic topology, this paper systematically investigates the connectivity and singularity of the parameter space of feedforward ReLU networks defined on general DAG architectures. It reveals the critical role of bottleneck nodes and balance conditions in determining the topological structure of the parameter space, and establishes a theoretical connection between singularities and differentiable pruning.
Towards Efficient Constraint Handling in Neural Solvers for Routing Problems: This paper proposes the Construct-and-Refine (CaR) framework, which achieves efficient feasibility repair through joint training of a construction module and a lightweight refinement module. CaR provides the first general and efficient neural constraint-handling solution for hard-constrained routing problems, substantially outperforming classical and neural SOTA solvers on TSPTW and CVRPBLTW.
TurboBoA: Faster and Exact Attention-aware Quantization without Backpropagation: TurboBoA proposes a backpropagation-free post-training quantization method for LLMs that achieves over 3× speedup over BoA while retaining its accuracy advantages, through three innovations: joint multi-output-channel quantization, preceding-layer error compensation, and adaptive grid selection.
Understanding Dataset Distillation via Spectral Filtering: This paper proposes UniDD, a spectral filtering framework that unifies diverse dataset distillation methods as applying different filter functions on the feature-feature correlation (FFC) matrix to match the frequency information of the feature-label correlation (FLC) matrix. Building on this insight, the paper further introduces Curriculum Frequency Matching (CFM).
UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation: This paper proposes UniFlow, a universal unified tokenizer that preserves semantic understanding via hierarchical adaptive self-distillation and achieves high-fidelity reconstruction via a lightweight patch-wise pixel flow decoder. UniFlow achieves state-of-the-art performance on both understanding and generation across 13 benchmarks. The 7B UniFlow-XL surpasses the 14B TokenFlow-XL by 6.05% on average understanding benchmarks while using 40% less training data.
Unveiling Super Experts in Mixture-of-Experts Large Language Models: This paper is the first to discover and systematically study "Super Experts" (SEs) in MoE LLMs—an extremely small subset of experts that are critical to model inference, driving massive activations and attention sink mechanisms through extreme activation outliers in their down_proj outputs.
What Layers When: Learning to Skip Compute in LLMs with Residual Gates: This paper proposes GateSkip—a method that inserts a sigmoid-linear gate at the output of each Attention/MLP branch in a decoder-only Transformer, jointly optimizes gate sparsity and language modeling objectives during fine-tuning, and at inference time deterministically skips low-importance tokens layer-by-layer using a quantile threshold over gate values, thereby achieving token-level adaptive depth. On Llama 8B, GateSkip saves 15% compute while retaining >90% accuracy; on instruction-tuned models, the full-compute variant actually improves accuracy over the baseline, and ~50% savings still matches the baseline. The method is orthogonal and composable with INT4 quantization, structured pruning, and self-speculative decoding.
Why Attention Patterns Exist: A Unifying Temporal Perspective Analysis: This paper proposes the TAPPA framework, which explains the formation mechanisms of various attention patterns in LLMs (attention sink, diagonal, periodic, etc.) from a temporal continuity perspective in a unified manner, and leverages query self-similarity (q-similarity) as a metric to guide KV cache compression and model pruning tasks.

🏥 Medical Imaging¶

Adaptive Domain Shift in Diffusion Models for Cross-Modality Image Translation: This paper proposes CDTSDE, a framework that embeds a learnable spatial-adaptive domain mixing field $\Lambda_t$ into the reverse SDE of diffusion models, enabling cross-modality translation paths to traverse low-energy manifolds. The approach achieves higher fidelity with fewer denoising steps on MRI modality conversion, SAR→Optical, and industrial defect semantic mapping tasks.
Adaptive Test-Time Training for Predicting Need for Invasive Mechanical Ventilation in Multi-Center Cohorts: This paper proposes AdaTTT, a framework that achieves robust test-time adaptation on multi-center ICU EHR data for 24-hour-ahead invasive mechanical ventilation (IMV) prediction, via dynamic feature-aware self-supervised learning (adaptive masking strategy) and prototype-guided partial optimal transport alignment.
AFD-INSTRUCTION: A Comprehensive Antibody Instruction Dataset with Functional Annotations for LLM-Based Understanding and Design: This work constructs AFD-Instruction, the first large-scale antibody functional annotation instruction dataset (430K+ entries), aligning antibody sequences with natural-language functional descriptions via a multi-agent literature extraction pipeline. The dataset is used to instruction-tune general-purpose LLMs for antibody understanding and function-guided design, achieving an average accuracy improvement of 20+ points across five classification tasks.
An Orthogonal Learner for Individualized Outcomes in Markov Decision Processes: This paper systematically introduces semiparametric efficiency theory from causal inference into Q-function estimation for MDPs. It demonstrates that classical Q-regression and FQE are essentially naive plug-in learners subject to plug-in bias, and proposes the DRQQ-learner—a meta-learner that simultaneously achieves double robustness, Neyman orthogonality, and near-oracle efficiency. By deriving the efficient influence function (EIF) to construct a debiased two-stage loss, the method comprehensively outperforms baselines in Taxi and Frozen Lake environments.
AntigenLM: Structure-Aware DNA Language Modeling for Influenza: AntigenLM is a GPT-2-style DNA language model that preserves the integrity of genomic functional units. Pretrained on complete influenza virus whole genomes and subsequently fine-tuned, it autoregressively predicts antigenic sequences of future circulating strains, achieving significantly lower amino acid mismatch rates than the evolutionary model beth-1 and general-purpose genomic models.
ATPO: Adaptive Tree Policy Optimization for Multi-Turn Medical Dialogue: This paper proposes ATPO (Adaptive Tree Policy Optimization), which models multi-turn medical dialogue as a hierarchical Markov decision process (H-MDP). ATPO dynamically allocates rollout budgets via an uncertainty-aware adaptive tree expansion mechanism, using a composite uncertainty measure combining Bellman error and action-value variance to guide exploration. With Qwen3-8B, ATPO surpasses GPT-4o on three medical dialogue benchmarks.
Augmenting Representations with Scientific Papers: This paper proposes the first multimodal foundation model framework that aligns X-ray spectra with scientific literature via contrastive learning, achieving 20% Recall@1% cross-modal retrieval in a shared latent space, improving physical parameter estimation by 16–18%, and discovering rare astrophysical objects including candidate pulsating ultraluminous X-ray sources.
Benchmarking ECG FMs: A Reality Check Across Clinical Tasks: A comprehensive "reality check" benchmark evaluating 8 ECG foundation models across 12 datasets and 26 clinical tasks reveals that the compact structured state space model (SSM) ECG-CPC outperforms large-scale Transformers in 5 out of 7 task categories, demonstrating that architectural design matters more than model scale.
BiomedSQL: Text-to-SQL for Scientific Reasoning on Biomedical Knowledge Bases: This paper introduces BiomedSQL, the first benchmark specifically designed to evaluate the scientific reasoning capabilities of Text-to-SQL systems on biomedical knowledge bases. It comprises 68,000 question/SQL/answer triples and reveals a substantial gap between the best-performing model (GPT-o3-mini, 62.6%) and domain experts (90%).
Boosting Medical Visual Understanding From Multi-Granular Language Learning: This paper proposes Multi-Granular Language Learning (MGLL), a plug-and-play contrastive learning framework that jointly optimizes a soft CLIP loss, a point-wise loss, and a smooth KL divergence to align medical images with multi-label, multi-granular text descriptions. MGLL consistently surpasses state-of-the-art methods on fundus and X-ray datasets, and when used as a visual encoder for multimodal large language models, improves diagnostic accuracy by up to 34.1%.
Bridging Explainability and Embeddings: BEE Aware of Spuriousness: This paper proposes the BEE framework, which identifies and names spurious correlations (SCs) directly from learned classifier weights by analyzing how fine-tuning perturbs the weight-space geometry of pre-trained representations. The method requires no counterfactual samples and can discover hidden dataset biases. On ImageNet-1k, BEE uncovers spurious associations that reduce accuracy by up to 95%.
Can SAEs Reveal and Mitigate Racial Biases of LLMs in Healthcare?: This paper investigates whether Sparse Autoencoders (SAEs) can reveal and mitigate racial biases in LLMs within clinical settings. SAEs successfully identify harmful race-associated features (e.g., co-activation of "Black" with violence-related terms), but their effectiveness at bias mitigation in complex clinical tasks is limited (FLDD < 3%), falling far short of simple prompting strategies (FLDD 8–15%).
CARE: Towards Clinical Accountability in Multi-Modal Medical Reasoning with an Evidence-Grounded Agentic Framework: This paper proposes CARE, a framework that decomposes medical VQA into a three-stage expert pipeline—entity proposal → referring segmentation → evidence-grounded QA—with RLVR fine-tuning applied to each VLM and GPT-5 serving as a dynamic coordinator for tool planning and CoT review. CARE achieves an average accuracy of 77.54% across four medical VQA benchmarks using only 10B parameters, surpassing the 32B end-to-end state-of-the-art (72.29%).
Causal Interpretation of Neural Network Computations with Contribution Decomposition: This paper proposes CODEC (Contribution Decomposition), which applies Integrated Gradients to compute the contribution of hidden-layer neurons to the output (rather than analyzing activations alone), and then decomposes these contributions into sparse modes via a Sparse Autoencoder. This approach achieves stronger causal interpretability and network control than activation-based analysis, and is successfully applied to ResNet-50 and a retinal biological neural network model.
Characterizing Human Semantic Navigation in Concept Production as Trajectories in Embedding Space: This paper proposes modeling the human concept production process as cumulative trajectories in Transformer embedding space, defining 5 kinematic metrics (distance, velocity, acceleration, entropy, and centroid distance). Evaluated on 4 datasets spanning 3 languages and covering neurodegenerative disease, taboo word fluency, and attribute listing tasks, the framework successfully distinguishes clinical groups and concept categories, with highly consistent results across different embedding models.
COMPASS: Robust Feature Conformal Prediction for Medical Segmentation Metrics: COMPASS constructs conformal prediction intervals by applying linear perturbations along the low-dimensional subspace most sensitive to the target metric within the intermediate feature space of a segmentation network. It achieves significantly narrower prediction intervals than conventional CP methods across four medical segmentation tasks while maintaining valid coverage.
ConfHit: Conformal Generative Design with Oracle Free Guarantees: This paper proposes ConfHit, a framework that employs density-ratio-weighted conformal permutation p-values to perform certification (determining whether a generated batch contains a hit) and design (pruning the candidate set while preserving statistical guarantees). Without requiring an experimental oracle and under distributional shift, ConfHit provides finite-sample $1-\alpha$ coverage guarantees for generative molecular design.
Controllable Sequence Editing for Biological and Clinical Trajectories: This paper proposes Clef, a controllable sequence editing model based on temporal concepts that performs immediate and delayed editing of biological/clinical multivariate trajectories under given conditions (e.g., drugs, surgery). On cell reprogramming and patient laboratory test data, Clef achieves 16.28% MAE improvement for immediate editing, 26.73% for delayed editing, and up to 62.84% improvement for zero-shot counterfactual generation.
Controlling Repetition in Protein Language Models: This work presents the first systematic study of pathological repetition in protein language models (PLMs), introducing a unified repetition metric $R(x)$ and a utility metric $U(x)$, and proposes UCCS (Utility-Controlled Contrastive Steering), a method that injects steering vectors decoupled from repetition into hidden layers at inference time to suppress repetition while preserving folding credibility without retraining the model.
CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of LLMs in Mental Health QA: CounselBench is a two-component benchmark constructed with 100 licensed mental health professionals — CounselBench-EVAL (2,000 expert annotations across six clinical dimensions) and CounselBench-Adv (120 adversarial questions with 1,080 annotated responses) — systematically revealing that LLMs achieve superficially high scores in mental health open-ended QA while exhibiting safety risks such as over-generalization and unsolicited medical advice, and demonstrating that LLM-as-Judge is severely unreliable in safety-critical domains.
CryoNet.Refine: A One-step Diffusion Model for Rapid Refinement of Structural Models with Cryo-EM Density Map Restraints: CryoNet.Refine is proposed as the first AI-based framework for cryo-EM atomic model refinement. It integrates a one-step diffusion model initialized from Boltz-2 weights, a novel differentiable density generator that physically simulates synthetic density maps, and the first use of density map correlation (cosine similarity) as a differentiable loss function, jointly optimized with geometric constraint losses including Ramachandran, rotamer, and bond angle terms. A test-time optimization strategy enables per-case customization. The method comprehensively outperforms Phenix.real_space_refine on 120 protein and DNA/RNA complex benchmarks (CC_mask: 0.59 vs. 0.54; Ramachandran favored: 98.92%).
Decentralized Attention Fails Centralized Signals: Rethinking Transformers for Medical Time Series: This paper proposes the TeCh framework, whose core contribution is the CoTAR (Core Token Aggregation-Redistribution) module, which replaces standard attention in Transformers to model channel dependencies in medical time series. By introducing a global "core token" as a proxy — first aggregating information from all channels and then redistributing it back — the computational complexity is reduced from $O(n^2)$ to $O(n)$. On the APAVA dataset, TeCh achieves 86.86% accuracy (surpassing Medformer by 12.13%) while consuming only 33% of the memory and 20% of the inference time.
Deep Hierarchical Learning with Nested Subspace Networks for Large Language Models: This paper proposes Nested Subspace Networks (NSN), which reparameterize linear layers via low-rank decomposition into a strictly nested subspace hierarchy. Combined with uncertainty-aware multi-rank training, a single model can instantaneously trade off computation against performance at test time (50% FLOPs reduction with only 5% accuracy loss), and can be applied post-hoc to pretrained LLMs.
DISCO: Densely-overlapping Cell Instance Segmentation via Adjacency-aware Collaborative Coloring: This paper formulates densely-overlapping cell instance segmentation as a graph coloring problem and proposes Disco, a divide-and-conquer framework combining explicit conflict node marking with implicit adjacency-constrained disambiguation. By decomposing cell adjacency graphs via BFS and introducing five collaborative loss functions, Disco achieves a 7.08% PQ improvement on the high-density pathology dataset GBC-FS 2025 while attaining state-of-the-art performance across four heterogeneous datasets.
Discrete Diffusion Trajectory Alignment via Stepwise Decomposition: This paper proposes SDPO (Stepwise Decomposition Preference Optimization), which decomposes the trajectory alignment problem of discrete diffusion models into stepwise posterior alignment subproblems, avoiding the difficulty of backpropagating gradients through the entire denoising chain. SDPO achieves significant improvements over existing methods across three tasks: DNA sequence design, protein inverse folding, and language modeling.
DistMLIP: A Distributed Inference Platform for Machine Learning Interatomic Potentials: DistMLIP is a distributed inference platform based on a zero-redundancy graph-level parallelization strategy that addresses the lack of multi-GPU support in existing machine learning interatomic potentials (MLIPs). On 8 GPUs, it enables simulations approaching one million atoms, achieving up to 8× speedup over spatial partitioning methods while supporting systems 3.4× larger.
Distributional Consistency Loss: Beyond Pointwise Data Terms in Inverse Problems: This paper proposes the Distributional Consistency (DC) loss, which replaces conventional pointwise data fidelity terms (e.g., MSE/NLL) with distribution-level calibration, thereby eliminating overfitting to noise. The approach achieves significant performance gains in DIP-based denoising and PET image reconstruction without requiring early stopping.
DM4CT: Benchmarking Diffusion Models for Computed Tomography Reconstruction: DM4CT is proposed as the first systematic benchmark for diffusion-based CT reconstruction, encompassing ten diffusion methods and seven baselines evaluated comprehensively across medical, industrial, and synchrotron radiation datasets, revealing both the strengths and limitations of diffusion models in CT reconstruction.
DriftLite: Lightweight Drift Control for Inference-Time Scaling of Diffusion Models: DriftLite exploits the inherent degrees of freedom between the drift and potential function in the Fokker-Planck equation to actively stabilize particle weights by solving a lightweight linear system for the optimal control drift at each step. This approach addresses weight degeneracy in Sequential Monte Carlo (SMC) at minimal computational cost, substantially outperforming Guidance-SMC baselines on Gaussian mixture, molecular system, and protein–ligand co-folding tasks.
Dual Distillation for Few-Shot Anomaly Detection: This paper proposes D24FAD, a dual distillation framework that combines teacher-student distillation on query images (TSD) and student self-distillation on support images (SSD), augmented by a learning-to-weight mechanism (L2W) for adaptive support importance estimation. The method achieves 100% AUROC on the APTOS fundus dataset with only 2-shot support.
EMR-AGENT: Automating Cohort and Feature Extraction from EMR Databases: This paper proposes EMR-AGENT, the first LLM agent-based framework for automated EMR preprocessing. By replacing hand-crafted rules with dynamic SQL interaction, it achieves cross-database cohort selection, feature extraction, and code mapping, demonstrating strong performance and generalization on MIMIC-III, eICU, and SICdb.
EvoFlows: Evolutionary Edit-Based Flow-Matching for Protein Engineering: EvoFlows proposes an edit-based flow matching approach that learns mutational trajectories between evolutionarily related protein sequences, enabling controllable numbers of edits (insertions, deletions, substitutions) on a template sequence while jointly predicting what to mutate and where to mutate.
Exo-Plore: Exploring Exoskeleton Control Space through Human-Aligned Simulation: This paper proposes the Exo-plore framework, which combines neuromechanical simulation with deep reinforcement learning to optimize hip exoskeleton control parameters without requiring human subject experiments, and generalizes to pathological gait scenarios.
ExpGuard: LLM Content Moderation in Specialized Domains: This paper proposes ExpGuard, a safety guardrail model targeting specialized domains such as finance, healthcare, and law, along with a companion dataset ExpGuardMix (58,928 samples). ExpGuard achieves prompt classification F1 exceeding WildGuard by 8.9% and response classification by 15.3% on domain-specific test sets, while maintaining state-of-the-art performance on general safety benchmarks.
Exploiting Low-Dimensional Manifold of Features for Few-Shot Whole Slide Image Classification: This paper identifies that pathology foundation model (PFM) features reside on a low-dimensional manifold (effective rank only 29.7 out of 512 dimensions), and that standard linear layers destroy this geometric structure, causing few-shot overfitting. The authors propose a plug-and-play MR Block — combining a frozen random matrix as a geometric anchor with a low-rank residual path for task adaptation — achieving state-of-the-art performance on few-shot WSI classification.
Fine-Tuning Diffusion Models via Intermediate Distribution Shaping: This work unifies rejection-sampling-based fine-tuning methods under the GRAFT framework, proving that they implicitly perform KL-regularized reward maximization. Building on this, P-GRAFT is proposed to perform distribution shaping at intermediate denoising steps (achieving a better bias–variance trade-off), and Inverse Noise Correction is introduced to improve flow model quality without reward signals, yielding an 8.81% VQAScore improvement on text-to-image generation.
From Conversation to Query Execution: Benchmarking User and Tool Interactions for EHR Database Agents: This paper proposes EHR-ChatQA, the first benchmark to evaluate the end-to-end interactive workflow of database agents in electronic health record (EHR) settings — covering ambiguity clarification, terminology mismatch resolution, SQL generation, and answer return. Evaluation reveals that the strongest model (o4-mini) achieves Pass@5 above 90% but suffers a substantial drop in Pass∧5 (all-success rate), with a gap of up to 60%, exposing critical robustness deficiencies in safety-sensitive clinical domains.
Fusing Pixels and Genes: Spatially-Aware Learning in Computational Pathology: This paper proposes the Stamp framework, which leverages spatial transcriptomics gene expression data as a supervisory signal. Through spatially-aware gene encoder pretraining and hierarchical multi-scale contrastive alignment, it enables joint representation learning of pathology images and spatial transcriptomics data, achieving state-of-the-art performance across 4 downstream tasks on 6 datasets.
Glance and Focus Reinforcement for Pan-cancer Screening: This paper proposes GF-Screen, a two-stage framework in which a lightweight Glance model employs reinforcement learning to rapidly localize CT sub-volumes containing lesions, while a Focus model performs fine-grained segmentation exclusively on the selected regions. By transferring GRPO's group-relative comparison paradigm from NLP to visual sub-volume groups, the method achieves RL optimization without a value network for the first time in a purely visual task. On the FLARE25 pan-cancer challenge, GF-Screen outperforms the champion solution by +25.6% DSC while achieving 5.7× faster inference.
HistoPrism: Unlocking Functional Pathway Analysis from Pan-Cancer Histology via Gene Expression Prediction: This paper proposes HistoPrism, an efficient Transformer architecture that injects cancer-type conditioning via cross-attention to predict pan-cancer gene expression from H&E histology images. It further introduces the Gene Pathway Coherence (GPC) evaluation framework based on Hallmark/GO pathways, achieving substantial improvements over STPath at the pathway level—particularly on low-variance, biologically fundamental pathways.
How to Make the Most of Your Masked Language Model for Protein Engineering: This work proposes a temperature-annealed stochastic beam search (SBS) sampling method for masked language models (MLMs), leveraging a wild-type marginal approximation of pseudo-log-likelihood (PLL) for efficient full-sequence evaluation. In vitro experiments on real therapeutic antibody optimization demonstrate that the choice of sampling algorithm is at least as important as model selection; SBS with guidance achieves a 100% success rate.
Human Behavior Atlas: Benchmarking Unified Psychological and Social Behavior Understanding: This work introduces Human Behavior Atlas—the first large-scale multimodal unified benchmark for behavior understanding spanning four dimensions (affective, cognitive, pathological, and social processes) with 101K+ samples—and trains three OmniSapiens-7B model variants to validate its effectiveness in multi-task training and transfer learning.
Improving 2D Diffusion Models for 3D Medical Imaging with Inter-Slice Consistent Stochasticity: This paper proposes Inter-Slice Consistent Stochasticity (ISCS), which generates inter-slice correlated noise via spherical linear interpolation (Slerp) during the re-noising step of diffusion sampling, eliminating inter-slice discontinuity artifacts in 3D medical reconstruction with 2D diffusion priors at their root cause — with zero additional computation, hyperparameters, or training overhead, and plug-and-play compatibility with any 2D diffusion inverse problem solver, yielding consistent improvements on sparse-view CT, limited-angle CT, and MRI super-resolution.
Incentives in Federated Learning with Heterogeneous Agents: This paper analyzes incentive problems in heterogeneous federated learning from a game-theoretic perspective, proves the existence of pure-strategy Nash equilibria under heterogeneous data distributions and PAC accuracy objectives, and proposes a linear programming-based approximation algorithm to determine optimal contribution levels.
Inference-Time Dynamic Modality Selection for Incomplete Multimodal Classification: This paper proposes DyMo, an inference-time dynamic modality selection framework that derives a theoretically grounded MTIR reward function (based on a classification-loss-reduction proxy + class prototype distance + intra-class similarity calibration) to iteratively and selectively fuse reliable recovered modalities at inference time, offering the first systematic resolution of the discarding-imputation dilemma: discarding missing modalities loses task-relevant information, while imputation may introduce noise.
Intrinsic Lorentz Neural Network: This paper proposes ILNN, a fully intrinsic hyperbolic neural network in which all operations are performed entirely within the Lorentz model, eliminating the geometric inconsistencies introduced by Euclidean operations in existing methods. ILNN achieves state-of-the-art performance on image classification, genomics, and graph classification tasks.
Knowledgeable Language Models as Black-Box Optimizers for Personalized Medicine: This paper proposes LEON (LLM-based Entropy-guided Optimization with kNowledgeable priors), a mathematically rigorous framework that models personalized treatment design as a constrained conditional black-box optimization problem. Through entropy constraints and an adversarial source critic, LEON guides an LLM to serve as a zero-shot optimizer that proposes personalized treatment plans without any fine-tuning.
Learning Domain-Aware Task Prompt Representations for Multi-Domain All-in-One Image Restoration: This paper proposes DATPRL-IR, the first multi-domain all-in-one image restoration method, which learns domain-aware task prompt representations via a dual prompt pool (task prompt pool + domain prompt pool). Domain priors are distilled from an MLLM and injected into the backbone through adaptive gated fusion, achieving significant improvements over SOTA across 9 tasks spanning natural, medical, and remote sensing domains.
Learning Patient-Specific Disease Dynamics with Latent Flow Matching for Longitudinal Imaging Generation: This paper proposes the Δ-LFM framework, which employs an ArcRank loss to construct patient-specific, temporally aligned trajectories in latent space (directionally consistent and monotonically increasing in magnitude). The framework extends the flow matching time range from $[0,1]$ to $[0,T]$ (actual time intervals) to enable prediction at arbitrary time points. Δ-LFM comprehensively outperforms eight baseline methods across three Alzheimer's longitudinal MRI benchmarks and introduces a progression-specific evaluation metric, Δ-RMAE.
mCLM: A Modular Chemical Language Model that Generates Functional and Makeable Molecules: This paper proposes mCLM (Modular Chemical Language Model), which represents molecules as sequences of synthesizable building blocks, enabling LLMs to generate molecules that simultaneously satisfy pharmacological function and automated synthesis feasibility. mCLM achieves significant improvements in pharmacokinetic and toxicity properties across 430 FDA-approved drugs.
MedAgentGym: A Scalable Agentic Training Environment for Code-Centric Reasoning in Biomedical Data Science: This work introduces MedAgentGym, the first unified agentic training environment for biomedical data science, comprising 72,413 task instances spanning 12 real-world scenarios and 129 categories, equipped with an executable sandbox and verifiable ground truth. A systematic benchmark evaluation of 29 LLMs reveals a substantial gap between commercial and open-source models. By combining efficient multi-threaded trajectory sampling with offline/online RL, the authors train Med-Copilot, achieving gains of +43.02%/+45.28% respectively and attaining performance competitive with GPT-4o.
MMedAgent-RL: Optimizing Multi-Agent Collaboration for Multimodal Medical Reasoning: This paper proposes MMedAgent-RL, a multi-agent system that simulates clinical consultation workflows (triage → specialist → attending physician) optimized via reinforcement learning. The core innovation is Curriculum-guided Multi-Agent Reinforcement Learning (C-MARL) with entropy-aware exploration, enabling the attending physician agent to adopt differentiated explore–exploit strategies when faced with correct, conflicting, or erroneous specialist opinions. The system achieves state-of-the-art performance on 5 medical VQA benchmarks spanning both in-domain and out-of-domain settings.
Moving Beyond Medical Exams: A Clinician-Annotated Fairness Dataset of Real-World Tasks and Ambiguity in Mental Healthcare: This paper introduces MENTAT—an evaluation dataset designed and annotated by 9 U.S. psychiatrists, comprising 203 base questions (each with 5 answer choices) expanded via demographic variable substitution, covering 5 clinical practice domains: diagnosis, treatment, triage, monitoring, and documentation. By systematically substituting patient age, race, and gender, the benchmark evaluates decision-making bias across 22 language models, revealing significant and unpredictable accuracy disparities along demographic dimensions.
NeuroCircuitry-Inspired Hierarchical Graph Causal Attention Networks for Explainable Depression Identification: This paper proposes the NH-GCAT framework, which explicitly incorporates neuroscience priors on depression-related neural circuits into a GNN, modeling brain activity at three spatial scales—region, circuit, and network. The method achieves state-of-the-art classification on the REST-meta-MDD dataset (AUC 78.5%, ACC 73.8%) and provides interpretable analyses consistent with established neuroscientific findings.
Omni-iEEG: A Large-Scale, Comprehensive iEEG Dataset and Benchmark for Epilepsy Research: This paper introduces the Omni-iEEG dataset (302 patients, 178 hours of high-resolution intracranial EEG recordings), defines standardized benchmark tasks and evaluation metrics grounded in clinical priors, and demonstrates that end-to-end modeling can match or surpass traditional biomarker-based approaches for epilepsy surgical planning.
Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling: This paper theoretically identifies two fundamental flaws in existing length-penalty approaches—incorrectly penalizing high-entropy exploration tokens and erroneously rewarding redundant tokens—and proposes the DeCS framework. Through decoupled token-level rewards and curriculum batch scheduling, DeCS reduces reasoning tokens by over 50% across 7 benchmarks while maintaining or even improving model performance.
Protein as a Second Language for LLMs: This work treats amino acid sequences as a "second language" for LLMs. By constructing a protein–natural language bilingual dataset and an adaptive context construction mechanism, the proposed framework enables general-purpose LLMs to achieve an average ROUGE-L improvement of 7%—up to 17.2%—on protein question-answering tasks without any training, even surpassing domain-specific fine-tuned models.
Protein Counterfactuals via Diffusion-Guided Latent Optimization: This paper proposes MCCOP, a framework that performs gradient-guided counterfactual optimization in a continuous joint sequence–structure latent space, using a pretrained diffusion model as a manifold prior. With as few as 2–3 mutations, MCCOP generates biologically plausible protein variants that flip predictor outputs, simultaneously enabling model interpretation and protein design hypothesis generation.
Protein Structure Tokenization via Geometric Byte Pair Encoding: This paper proposes GeoBPE — the first tokenizer to extend Byte Pair Encoding (BPE) from discrete text to continuous protein backbone geometry. By alternating between local merging (k-medoids clustering + quantization) and global correction (differentiable inverse kinematics), GeoBPE constructs a hierarchical structural motif vocabulary that achieves >10× compression ratio and >10× data efficiency over VQ-VAE-based protein structure tokenizers (PSTs), ranking first across 24 test sets spanning 12 downstream tasks.
Q-FSRU: Quantum-Augmented Frequency-Spectral Fusion for Medical Visual Question Answering: This paper proposes the Q-FSRU framework, which transforms medical image and text features into the frequency domain via FFT for fusion, and introduces a quantum-inspired retrieval augmentation mechanism (Quantum RAG) to retrieve medical facts from an external knowledge base, achieving 90.0% accuracy on the VQA-RAD dataset.
Resp-Agent: An Agent-Based System for Multimodal Respiratory Sound Generation and Disease Diagnosis: This paper proposes Resp-Agent, a closed-loop multi-agent framework that coordinates a controllable respiratory sound generator and a multimodal diagnoser via an active adversarial curriculum planner (Thinker-A2CA). Built upon a 229k-scale benchmark, the system achieves co-design of generation and diagnosis, substantially improving diagnostic performance on long-tail categories.
Reverse Distillation: Consistently Scaling Protein Language Model Representations: To address the anomalous scaling phenomenon in protein language models (PLMs) where larger models do not necessarily yield better performance, this paper proposes a reverse distillation framework. It uses the representations of a smaller model as a base, extracts orthogonal residual information from a larger model via SVD, and constructs Matryoshka nested embeddings—ensuring that larger reverse-distilled models consistently outperform smaller ones. ESM-2 15B, after reverse distillation, becomes for the first time the strongest model in its family.
Scalable Spatio-Temporal SE(3) Diffusion for Long-Horizon Protein Dynamics: This paper proposes STAR-MD, an SE(3)-equivariant causal diffusion Transformer that achieves microsecond-scale protein dynamics trajectory generation via joint spatio-temporal attention and contextual noise perturbation. STAR-MD attains state-of-the-art performance across all metrics on the ATLAS benchmark and stably extrapolates to microsecond timescales unseen during training.
Scaling with Collapse: Efficient and Predictable Training of LLM Families: This paper demonstrates that the training loss curves (TLCs) of LLM families "collapse" onto a single universal curve when optimization hyperparameters are matched to the data budget, and leverages this phenomenon for two practical applications: (1) deviation from collapse as an early diagnostic signal for training pathologies, and (2) the predictability of the collapse curve enabling early stopping for large-scale hyperparameter tuning.
Shoot First, Ask Questions Later? Building Rational Agents that Explore and Act Like People: This paper introduces the Collaborative Battleship task to evaluate the information-seeking capabilities of language models, and proposes three Bayesian inference strategies (Bayes-Q/M/D) to enhance LM questioning, action selection, and decision-making. The approach enables a weak model (Llama-4-Scout) to achieve superhuman performance (82% win rate) at approximately 1% the cost of GPT-5.
SONIC: Spectral Oriented Neural Invariant Convolutions: SONIC transfers the core idea of state space models to the multi-dimensional frequency domain, defining a set of orientation-selective spectral transfer functions using 6 continuous parameters (amplitude, orientation, damping, oscillation, etc.), and mixing across channels via low-rank matrices $B$ and $C$. This yields a drop-in convolutional replacement operator that inherently possesses a global receptive field and resolution invariance. On 3D medical segmentation, it matches nnU-Net with nearly two orders of magnitude fewer parameters, and is also competitive on ImageNet.
SurvHTE-Bench: A Benchmark for Heterogeneous Treatment Effect Estimation in Survival Analysis: This paper introduces SurvHTE-Bench, the first comprehensive benchmark for heterogeneous treatment effect (HTE) estimation on right-censored survival data, encompassing 40 synthetic datasets, 10 semi-synthetic datasets, and 2 real-world datasets. It systematically evaluates 53 estimation methods under varying causal assumption violations and censoring levels, finding that no single method dominates, and that survival meta-learners—particularly S-Learner-Survival and Matching-Survival—are most robust under high censoring and assumption violations.
SynCoGen: Synthesizable 3D Molecule Generation via Joint Reaction and Coordinate Modeling: SynCoGen proposes a multimodal generative framework combining masked graph diffusion and flow matching to jointly sample molecular building-block reaction graphs and 3D atomic coordinates, achieving high-quality 3D molecule generation while guaranteeing synthetic feasibility.
Thompson Sampling via Fine-Tuning of LLMs: This paper proposes ToSFiT, which extends Thompson Sampling to large-scale unstructured discrete spaces by fine-tuning large language models to directly parameterize the Probability of Maximality (PoM), thereby circumventing the intractability of acquisition function maximization.
Tracing Pharmacological Knowledge in Large Language Models: The first systematic causal analysis of the encoding mechanisms for drug-group semantics in biomedical LLMs, revealing that drug-group knowledge is stored in early layers, distributed across multiple tokens (not the last token alone), and that linearly separable semantic information is already present at the embedding layer.
Ultra-Fast Language Generation via Discrete Diffusion Divergence Instruct: This paper proposes DiDi-Instruct, a distillation framework based on Integrated KL Divergence (IKL) minimization that compresses a pretrained diffusion large language model (dLLM) into a few-step student model. Through four key designs—adversarial density ratio estimation, grouped reward normalization, score decomposition, and a Reward-Guided Ancestral Sampler (RGAS)—the student model surpasses the 1024-step teacher's perplexity on OpenWebText using only 16 steps, achieving up to 64× inference speedup at a training cost of just 1 GPU hour.
Unified Biomolecular Trajectory Generation via Pretrained Variational Bridge: PVB (Pretrained Variational Bridge) unifies the training objectives of single-structure pretraining and paired-trajectory fine-tuning via an encoder-decoder architecture combined with augmented bridge matching, enabling cross-domain biomolecular trajectory generation. It further accelerates protein–ligand holo-state exploration through RL fine-tuning based on adjoint matching.

💡 LLM Reasoning¶

Adaptive Social Learning via Mode Policy Optimization for Language Agents: This paper proposes the Adaptive Social Learning (ASL) framework, which defines four hierarchical reasoning modes (ranging from intuitive response to deep prospective reasoning) and introduces the AMPO algorithm (combining mode-level and sample-level advantage estimation) to enable LLM agents to adaptively switch reasoning depth according to social scenario complexity. ASL outperforms GPT-4o by 15.6% on social intelligence tasks, surpasses GRPO by 7.0%, and reduces token consumption by 32.8%.
Agentified Assessment of Logical Reasoning Agents: This paper proposes an agent-based evaluation framework (AAA) that encapsulates assessment logic as an assessor agent and interacts with the agent under test via a standard A2A interface. On a FOLIO dataset systematically cleaned using the Vampire theorem prover, an auto-formalization agent (NL→Z3Py + SMT solving) achieves 86.70% accuracy, substantially outperforming the CoT baseline at 73.89%, with a particularly notable gain of 32.79 percentage points on contradiction detection (False class).
AgentMath: Empowering Mathematical Reasoning for Large Language Models via Tool-Augmented Agent: AgentMath proposes a tool-augmented agent framework that seamlessly integrates LLM reasoning with the computational precision of a code interpreter through automated data synthesis, multi-turn interactive reinforcement learning, and an efficient asynchronous training system. At the 30B-A3B scale, it achieves state-of-the-art performance on AIME24/25 and HMMT25 (90.6%/86.4%/73.8%), surpassing o3-mini and Claude-Opus-4.0-Thinking.
AIMCoT: Active Information-driven Multimodal Chain-of-Thought for Vision-Language Reasoning: AIMCoT reframes visual information selection in multimodal CoT from "passively attending to high-attention regions" to "actively seeking regions of maximal information gain." Three collaborative modules — CAG (Context-enhanced Attention-map Generation), AVP (Active Visual Probing), and DAT (Dynamic Attention-shifting Trigger) — constitute a training-free, plug-and-play framework that outperforms ICoT by 18.25% on LLaVA-W (0-shot).
Annotation-Efficient Universal Honesty Alignment: This paper proposes EliCal (Elicit then Calibrate), a two-stage framework that first trains an LLM to express internal confidence using annotation-free self-consistency signals, then calibrates with a minimal number of correctness annotations (only 1K samples, 0.18% of the full set). On HonestyBench (560K training + 70K evaluation), EliCal achieves approximately 98% of the fully-supervised upper bound and generalizes better than calibration-only baselines on unseen MMLU tasks.
Are Reasoning LLMs Robust to Interventions on Their Chain-of-Thought?: This paper systematically evaluates the robustness of reasoning LLMs to various interventions (benign/neutral/adversarial) in their chain-of-thought. Models are generally robust and can recover from interventions; however, paraphrasing suppresses "self-doubt" expressions and degrades accuracy, while the recovery process incurs significant computational overhead (CoT length inflation up to 665%).
ATTS: Asynchronous Test-Time Scaling via Conformal Prediction: This paper proposes ATTS, an asynchronous test-time scaling framework based on conformal prediction that eliminates synchronization overhead by reformulating rejection sampling as a hypothesis testing procedure. On mathematical reasoning benchmarks such as MATH and AIME, ATTS achieves up to 56.7× speedup and 4.14× throughput improvement without accuracy loss. A 1.5B/70B draft/target model combination reaches the AIME performance level of o3-mini (high).
Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts: This paper proposes the Contact Searching Question (CSQ) framework, which leverages directed graph reachability tasks and cognitive psychology principles to design two complementary statistical metrics—deception intent score $\rho$ and deception behavior score $\delta$—systematically revealing, for the first time, that 16 mainstream LLMs exhibit spontaneous deception tendencies under entirely benign prompts, with deception escalating as task difficulty increases.
Compositional Generalization from Learned Skills via CoT Training: A Theoretical and Structural Analysis for Reasoning: This paper leverages information-theoretic generalization bounds and mechanistic interpretability to demonstrate that the core mechanism of CoT training is compositional generalization—the model learns to systematically compose previously acquired simple skills to solve novel complex problems, internalizing this ability as a two-stage compositional reasoning circuit that extracts intermediate results at shallower layers, freeing deeper layers to focus on subsequent reasoning steps.
Conflict-Aware Fusion: Resolving Logic Inertia in Large Language Models via Structured Cognitive Priors: This paper identifies the phenomenon of "logic inertia" in LLMs—whereby models continue along learned reasoning trajectories even when presented with contradictory premises, reducing accuracy to 0.0—and proposes the Conflict-Aware Fusion dual-process architecture, which enforces premise verification prior to reasoning execution, achieving 100% accuracy on contradiction detection.
Continuous Chain of Thought Enables Parallel Exploration and Reasoning: CoT2 proposes replacing discrete tokens with continuous-valued tokens (convex combinations of vocabulary embeddings) for chain-of-thought reasoning, enabling the model to track multiple reasoning paths in parallel within a single forward pass. The approach is theoretically shown to be equivalent to $K$ rounds of self-consistency/best-of-N sampling, and is further improved via GRPO-based reinforcement learning.
CoT-RVS: Zero-Shot Chain-of-Thought Reasoning Segmentation for Videos: This paper proposes CoT-RVS, a fully training-free multi-agent framework that leverages the zero-shot CoT reasoning capabilities of pretrained MLLMs for temporal-semantic correlation analysis and keyframe selection, achieving substantial improvements over fine-tuned methods on reasoning video segmentation tasks (Refer-DAVIS J&F 79.1 vs. 71.2; ReasonVOS J&F 65.5 vs. 49.9).
CyclicReflex: Improving Reasoning Models via Cyclical Reflection Token Scheduling: This paper treats reflection tokens (e.g., "wait", "but") in the reasoning process as schedulable "resources" and, inspired by cyclical learning rate scheduling in optimization, proposes CyclicReflex — a training-free decoding strategy that dynamically modulates the logits of reflection tokens via a triangular waveform. CyclicReflex consistently improves the accuracy of 1.5B–8B models across multiple mathematical reasoning benchmarks (MATH500, AIME2024/2025, AMC2023).
DAG-Math: Graph-of-Thought Guided Mathematical Reasoning in LLMs: This work formalizes LLM chain-of-thought reasoning as a rule-based stochastic process over DAGs, proposes logical closeness as a metric to assess whether a model arrives at an answer through search or rigorous logical deduction, constructs a gold-standard DAG-MATH benchmark of 2,894 instances, and demonstrates that models with similar PASS@k scores can differ substantially in reasoning faithfulness.
DESIGNER: Design-Logic-Guided Multidisciplinary Data Synthesis for LLM Reasoning: This paper introduces Design Logic—reusable meta-knowledge reverse-engineered from real exam questions—to guide the synthesis of multidisciplinary reasoning problems from raw text. A dataset of 4.7 million questions spanning 75 disciplines is constructed, and base models fine-tuned via SFT on this data surpass their officially post-trained counterparts.
Doxing via the Lens: Revealing Location-related Privacy Leakage on Multi-modal Large Reasoning Models: This paper systematically reveals the privacy leakage risks of multi-modal large reasoning models (MLRMs) in inferring sensitive geographic location information from images. It proposes a three-tier privacy risk framework, the DoxBench benchmark, an information-theoretic metric Glare, and a collaborative attack framework GeoMiner.
Doxing via the Lens: Revealing Location-related Privacy Leakage on Multi-modal Large Reasoning Models: This paper presents the first systematic study of privacy leakage risks arising from multimodal large reasoning models (MLRMs) inferring sensitive geographic location information from user-generated images. It proposes a three-tier privacy risk framework, the DoxBench benchmark, and the Glare information-theoretic evaluation metric. The findings demonstrate that MLRMs surpass non-expert humans in geographic inference, significantly lowering the barrier for adversaries to obtain sensitive location information.
DRPO: Efficient Reasoning via Decoupled Reward Policy Optimization: This paper diagnoses a fundamental flaw in GRPO with length penalties — correct but verbose responses may receive negative advantage values and thus be incorrectly penalized — and proposes DRPO, which decouples the reward signals for positive and negative samples to ensure length penalties are normalized only within the correct-response group. On a 1.5B model, DRPO achieves a 77% length reduction with only a 1.1% performance drop, compared to a 68% reduction with a 4.3% drop for the baseline.
Dynamics-Predictive Sampling for Active RL Finetuning of Large Reasoning Models: This paper models each prompt's solve progress during RL finetuning as a latent Markov dynamical system, and employs lightweight Bayesian inference to online-predict prompt solve states. By prioritizing "partially solved" prompts for sampling, the method achieves comparable or superior reasoning performance to Dynamic Sampling (DS) using fewer than 30% of DS's rollouts.
Dynamics Within Latent Chain-of-Thought: An Empirical Study of Causal Structure: This paper models latent CoT as a structural causal model (SCM) and analyzes the Coconut and CODI paradigms via step-wise do-interventions, revealing that latent reasoning steps exhibit heterogeneous causal leverage, non-local jump-based propagation structures, and a persistent gap between early output commitment and late representational commitment.
Efficient Test-Time Scaling for Small Vision-Language Models: This paper proposes two efficient test-time scaling strategies for small VLMs: TTAug (applying diverse input augmentations and aggregating output probability distributions at the token level) and TTAdapt (adapting model parameters using pseudo-labels generated by TTAug). Both methods consistently improve performance across 9 benchmarks while achieving substantially better computational efficiency than existing sampling-based test-time scaling approaches.
Estimating the Empowerment of Language Model Agents: This paper proposes EELMA, an algorithm that leverages empowerment from information theory — defined as the mutual information between an agent's actions and future states — as a goal-agnostic capability metric for LM agents. EELMA achieves strong correlation with task performance ($r=0.83$–$0.94$) in both language games and real-world web navigation scenarios, and can be applied to open-ended agent monitoring and safety evaluation.
Evoking User Memory: Personalizing LLM via Recollection-Familiarity Adaptive Retrieval: Inspired by the dual-process theory in cognitive science, this paper proposes RF-Mem, a memory retrieval framework that achieves efficient and scalable LLM personalization through adaptive switching between two pathways: Familiarity (fast similarity matching) and Recollection (deep chain-based reconstruction).
FastGRPO: Accelerating Policy Optimization via Concurrency-aware Speculative Decoding and Online Draft Learning: Targeting the severe bottleneck in GRPO training where the generation phase consumes 91%–98% of total training time, this work proposes a concurrency-aware speculative decoding strategy (dynamically adjusting draft tree parameters to accommodate the real-time shift from high to low concurrency) and online draft model learning (continuously adapting to distribution drift using hidden states produced by the target model). The combined approach achieves 2.35×–2.72× end-to-end training speedup without degrading reasoning quality.
Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning: Fine-R1 combines CoT supervised fine-tuning (structured reasoning chains following "visual analysis → candidate sub-classes → comparison → prediction") with Triplet-Augmented Policy Optimization (TAPO)—intra-class augmentation for robustness and inter-class augmentation for discriminability—achieving superior performance over CLIP and general/reasoning MLLMs on fine-grained visual recognition using only 4-shot training.
Fixing the Broken Compass: Diagnosing and Improving Inference-Time Reward Modeling: This paper systematically diagnoses three failure modes of inference-time reward models (RMs)—performance degradation on easy problems, diminished discriminability as sample size increases, and accuracy loss under high search diversity—and proposes CRISP, an algorithm that mitigates these issues via answer-clustering-based reward aggregation and stepwise prefix guidance, achieving accuracy improvements of up to 5%.
From Abstract to Contextual: What LLMs Still Cannot Do in Mathematics: This paper introduces ContextMATH, a benchmark that transforms abstract AIME/MATH-500 problems into two variants — Scenario Grounding (SG) and Complexity Scaling (CS) — and reveals that even top-tier models such as GPT-5 and DeepSeek-R1 suffer accuracy drops of 13–34% on contextual mathematical reasoning, with errors attributable primarily to problem formulation rather than computational reasoning.
From Abstract to Contextual: What LLMs Still Cannot Do in Mathematics: This paper introduces the ContextMATH benchmark, which systematically converts abstract mathematical problems from AIME and MATH-500 into two contextual variants—Scenario Grounding (SG) and Complexity Scaling (CS)—to reveal substantial performance degradation in LLMs on contextual mathematical reasoning. Open-source models drop by 13% on average on SG and 34% on CS. Two complementary performance bottlenecks are identified: problem formulation and reasoning execution.
Generalizable End-to-End Tool-Use RL with Synthetic CodeGym: This paper proposes CodeGym, a framework that automatically converts programming problems into multi-turn interactive tool-use environments for reinforcement learning training of LLM agents, achieving significant out-of-distribution generalization gains (e.g., +8.7 points on τ-Bench for Qwen2.5-32B).
GeoGramBench: Benchmarking the Geometric Program Reasoning in Modern LLMs: This paper formalizes the Program-to-Geometry task and proposes GeoGramBench (500 problems), evaluating 19 frontier LLMs on their ability to construct geometric representations from procedural drawing code and reason over them using a three-level geometric complexity taxonomy. Even GPT-5 achieves only 39.26% accuracy at the highest abstraction level, revealing fundamental limitations in LLM spatial abstraction.
Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation: This paper identifies that the advantage function in GRPO (std normalization) causes update magnitudes to peak at medium-difficulty problems while implicitly suppressing updates on both hard and easy problems. To address this, the authors propose MathForge — combining DGPO (replacing std with MAD for difficulty-balanced normalization + softmax difficulty weighting) and MQR (question reformulation via three aspects: narrative context, abstract terminology, and nested sub-problems, increasing difficulty while preserving original answers). On Qwen2.5-Math-7B, MathForge outperforms GRPO by an average of +4.56% across six mathematical reasoning benchmarks.
HeurekaBench: A Benchmarking Framework for AI Co-scientist: This paper proposes HeurekaBench, a framework for constructing evaluation benchmarks grounded in real scientific workflows. It employs a multi-LLM pipeline to extract verifiable scientific insights from papers and generate open-ended research questions, enabling end-to-end assessment of AI co-scientists in data-driven scientific discovery.
I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift: This paper systematically investigates the fragility of frozen-embedding-based safety classifiers under embedding drift induced by model updates. It finds that a mere 2% perturbation in the embedding space is sufficient to degrade classifier performance from 85% ROC-AUC to near-random levels (50%), with 72% of misclassifications occurring at high confidence (silent failure). Counterintuitively, instruction-tuned models prove harder to classify than their base counterparts.
Is In-Context Learning Learning?: This paper systematically investigates whether ICL constitutes genuine "learning" through large-scale controlled experiments. It demonstrates that ICL satisfies the formal mathematical definition of learning, yet empirical evidence reveals its generalization capacity to be limited — models primarily exploit structural regularities within the prompt via deduction rather than acquiring new capabilities from the provided demonstrations.
Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort: This paper proposes TRACE (Truncated Reasoning AUC Evaluation), a method that quantifies reasoning effort by progressively truncating chain-of-thought (CoT) reasoning and measuring how early a model can obtain reward. TRACE detects implicit reward hacking that CoT monitoring fails to identify, achieving detection F1 improvements of over 65% and 30% compared to the strongest CoT monitors on math and code tasks, respectively.
LingOly-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation: This paper introduces LingOly-TOO, a benchmark that applies expert-designed grapheme-level permutations to linguistics olympiad problems, preserving reasoning logic while eliminating knowledge and memorization shortcuts. The obfuscation reduces the top score across 15 frontier models from 0.59 to 0.48, systematically quantifying the extent to which LLM reasoning ability is overestimated due to knowledge effects.
mR3: Multilingual Rubric-Agnostic Reward Reasoning Models: This paper introduces mR3, a family of multilingual rubric-agnostic reward reasoning models covering 72 languages. Through systematic data construction (GPT-OSS-120B distillation with difficulty filtering) and curriculum learning, the 14B model surpasses the 120B teacher model and all comparable baselines on multilingual evaluation benchmarks, while supporting point-wise, pair-wise, and binary evaluation paradigms.
Native Reasoning Models: Training Language Models to Reason on Unverifiable Data: This paper proposes NRT (Native Reasoning Training), a framework that treats reasoning chains as latent variables and uses the model's own predictive confidence over reference answers as an intrinsic reward signal to train LLM reasoning—without external verifiers or expert reasoning demonstrations. On Llama-3.1-8B, NRT achieves an average improvement of 10.2 points across 9 benchmarks (46.0→56.2), surpassing the verifier-dependent RLPR by +5.4 points.
No Answer Needed: Predicting LLM Answer Accuracy from Question-Only Linear Probes: Prior to answer generation, a linear probe (difference-of-means) trained solely on residual stream activations at the question-processing stage can predict whether a model's forthcoming answer will be correct. This "pre-generation correctness direction," trained on TriviaQA, generalizes across multiple factual knowledge datasets (AUROC 0.68–0.88) but fails to generalize to mathematical reasoning (GSM8K), revealing a structural separation between representations of factual correctness and reasoning correctness within the model's internals.
Nudging the Boundaries of LLM Reasoning: This paper identifies a fundamental limitation of GRPO: it cannot learn from problems that the model completely fails to solve (pass rate = 0%), producing zero gradients. The proposed method, NuRL, addresses this by injecting self-generated abstract hints (without revealing answers) into hard problems during training, converting them into learnable samples. NuRL consistently outperforms GRPO across 3 models and 6 benchmarks, and genuinely improves the pass@k capability upper bound.
On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning: This paper proposes the Regularized Policy Gradient (RPG) framework, which systematically derives and analyzes policy gradient methods based on Forward/Reverse KL divergence (in both normalized and unnormalized forms). It identifies a theoretical inconsistency in GRPO's KL term and achieves superior performance over GRPO, REINFORCE++, and DAPO on mathematical reasoning benchmarks.
On The Fragility of Benchmark Contamination Detection in Reasoning Models: This systematic study reveals that benchmark contamination detection in large reasoning models (LRMs) is extremely fragile: contamination introduced during the SFT stage becomes nearly undetectable after GRPO training (with PPO-style importance sampling and clipping identified as the root cause), and direct CoT SFT contamination of advanced LRMs leaves virtually no detectable trace—all 10 evaluated detection methods perform close to random guessing.
Plan and Budget: Effective and Efficient Test-Time Scaling on Reasoning LLMs: This paper proposes the Plan-and-Budget framework, which decomposes complex queries into sub-problems and adaptively allocates token budgets based on estimated complexity, achieving efficient test-time scaling for reasoning LLMs — with up to 70% accuracy improvement, 39% token reduction, and 193.8% gain on the E3 metric.
PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation: This work is the first to integrate decomposed Chain-of-Thought reasoning with multi-dimensional reinforcement learning (RL) for video-to-audio (V2A) generation. It addresses the objective entanglement problem via four specialized CoT modules (semantic/temporal/aesthetic/spatial) paired with corresponding reward functions, and proposes the Fast-GRPO algorithm to substantially reduce RL training cost.
RAIN-Merging: A Gradient-Free Method to Enhance Instruction Following in Large Reasoning Models with Preserved Thinking Format: To address the tension between strong reasoning capability and weak instruction following in large reasoning models (LRMs), this paper proposes RAIN-Merging, a two-stage gradient-free merging pipeline that preserves the thinking format via null-space projection and enhances instruction relevance via attention-guided per-module scaling coefficients. It integrates the capabilities of an instruction-tuned model (ITM) into an LRM without any gradient-based training, achieving consistent improvements across 4 instruction-following and 9 reasoning benchmarks.
RAIN-Merging: A Gradient-Free Method to Enhance Instruction Following Through Model Merging: This paper proposes RAIN-Merging, a gradient-free two-stage model merging method: it first applies null-space projection to preserve the thinking format of Large Reasoning Models (LRMs), then employs instruction-attention-guided merging coefficients to enhance instruction following, simultaneously improving instruction compliance and reasoning quality.
Reasoning or Retrieval? A Study of Answer Attribution on Large Reasoning Models: This work presents the first systematic study of answer attribution in large reasoning models (LRMs), revealing that reasoning (CoT) and retrieval (memory) mechanisms compete simultaneously to influence final answers. The paper proposes Farl (Forgetting-Augmented Reinforcement Learning), which suppresses retrieval shortcuts to enhance genuine reasoning capability.
ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization: This paper proposes ReForm, a reflective autoformalization paradigm that transforms the process of converting natural-language mathematics problems into Lean formal statements from single-pass generation into an iterative "generate → semantic self-verify → correct" loop. It further introduces the PBSO algorithm to optimize heterogeneous reward signals, achieving an average improvement of 22.6 percentage points over the strongest baselines across four benchmarks.
RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models: This paper proposes a formal definition of Reasoning Faithfulness (RF) decomposed into stance consistency and causal influence, constructs the RFEval benchmark comprising 7,186 instances across 7 tasks, and evaluates 12 open-source Large Reasoning Models (LRMs) via output-level counterfactual reasoning intervention. Key findings include: 49.7% of outputs are unfaithful, RL post-training degrades faithfulness, and task accuracy is not a reliable proxy for faithfulness.
Scaling Generalist Data-Analytic Agents: This paper proposes DataMind — a complete training framework for data-analytic agents — achieving diverse query synthesis via fine-grained task taxonomy with recursive difficulty composition, ensuring data quality through knowledge-augmented trajectory sampling and self-consistency filtering, employing a dynamic SFT+RL mixed training strategy, and implementing a memory-efficient asynchronous rollout framework. The resulting DataMind-14B achieves a 71.16% average score across multiple benchmarks, establishing a new state of the art and surpassing GPT-5 and DeepSeek-V3.1.
SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes: This paper proposes SceneCOT, the first framework to introduce Chain-of-Thought reasoning into 3D scene understanding. Through a four-stage reasoning pipeline (task recognition → region localization → entity grounding → grounded reasoning), intermediate reasoning steps are explicitly linked to visual grounding. SceneCOT achieves 34.7% Good Coherence on Beacon3D, surpassing the strongest baseline (20.4%) by over 70%.
SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models: This paper introduces SealQA, a challenging benchmark with three variants (Seal-0/Seal-Hard/LongSeal), where each question is carefully crafted by NLP researchers to trigger ambiguous, conflicting, or noisy search results. Even GPT-5 achieves at most 43.2% accuracy, revealing that test-time scaling does not yield reliable gains under noisy retrieval conditions.
Segment-Level Attribution for Selective Learning of Long Reasoning Traces: This paper applies Integrated Gradients to compute the attribution strength and direction consistency of each segment in long reasoning traces with respect to the final answer, identifies important segments for selective SFT, and achieves up to 4.7% accuracy improvement over full-CoT training while reducing output length by 18%.
Temperature as a Meta-Policy: Adaptive Temperature in LLM Reinforcement Learning: This paper proposes TAMPO (Temperature Adaptive Meta Policy Optimization), which reframes the sampling temperature as a learnable meta-policy. Through a bilevel loop — an inner loop for LLM policy optimization and an outer loop for adaptively updating the temperature distribution based on trajectory advantage signals — TAMPO requires no additional rollouts and consistently outperforms fixed-temperature baselines on mathematical reasoning benchmarks.
The First Impression Problem: Internal Bias Triggers Overthinking in Reasoning Models: Reasoning models form a "first impression" (internal bias) about the answer the moment they receive a question. When this intuitive guess conflicts with the subsequent systematic reasoning process, the model repeatedly second-guesses itself and re-examines its work, causing reasoning length to inflate by 21%–43%. Critically, none of the existing mitigation methods can fundamentally eliminate this effect.
The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs: This paper reveals that short-task benchmarks create an illusion of diminishing returns — marginal gains in per-step accuracy are amplified exponentially in long-horizon tasks. It identifies a "self-conditioning effect" in LLMs (whereby prior errors increase the probability of subsequent errors), shows that thinking models mitigate this effect, and demonstrates that GPT-5 thinking can execute tasks exceeding 2,100 steps.
The Path of Least Resistance: Guiding LLM Reasoning Trajectories with Prefix Consensus: This paper proposes PoLR (Path of Least Resistance), the first inference-time method that exploits prefix consensus in reasoning chains. By clustering short prefixes and expanding only the dominant cluster, PoLR replaces standard Self-Consistency while maintaining or improving accuracy on GSM8K, Math500, AIME, and GPQA, with 40%–60% reduction in token usage and up to 50% lower latency.
Thinking in Latents: Adaptive Anchor Refinement for Implicit Reasoning in LLMs: This paper proposes AdaAnchor, a latent-space reasoning framework that appends learnable anchor vectors to input embeddings and refines their states through iterative forward passes to achieve "silent thinking." An adaptive stopping mechanism based on anchor stability dynamically allocates computation according to instance difficulty. On mathematical reasoning benchmarks, AdaAnchor achieves up to 5% higher accuracy and 48–60% fewer average steps compared to fixed-step latent reasoning, while reducing output tokens by 92–93% relative to CoT.
TopoBench: Benchmarking LLMs on Hard Topological Reasoning: TopoBench is a benchmark comprising 6 categories of topological puzzles × 3 difficulty levels for evaluating the global spatial reasoning capabilities of LLMs. Frontier models solve fewer than 24% of hard-tier instances. Causal intervention experiments reveal that error frequency does not equal causal impact — low-frequency constraint forgetting is more destructive than high-frequency repetitive reasoning.
Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention: This paper proposes Intervened Preference Optimization (IPO), which constructs preference pairs for training by replacing compliance cues with safety triggers at critical steps during the reasoning process, significantly improving the safety of the chain-of-thought (CoT) reasoning process itself in large reasoning models (LRMs).
Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention: This paper identifies a critical yet overlooked problem in large reasoning models (LRMs): their chain-of-thought reasoning frequently contains harmful content even when the final response appears safe. The authors propose Intervened Preference Optimization (IPO), which corrects unsafe reasoning trajectories by replacing compliance cues with safety triggers, constructing preference pairs for alignment training. Across 3 LRMs, IPO reduces reasoning harmfulness by over 30% without compromising reasoning capability.
Training Large Reasoning Models Efficiently via Progressive Thought Encoding: This paper proposes Progressive Thought Encoding, which encodes evicted token information into fixed-size LoRA weight updates whenever KV cache entries are evicted, enabling efficient RL training of large reasoning models under constrained cache budgets while preserving long-range reasoning capability.
Training Large Reasoning Models Efficiently via Progressive Thought Encoding: This paper proposes Progressive Thought Encoding, which encodes evicted thought tokens into LoRA weights under KV cache constraints, halving GPU memory usage during RL training of large reasoning models while surpassing full-cache LoRA in reasoning accuracy (up to +23.4% on AIME2024/2025).
TumorChain: Interleaved Multimodal Chain-of-Thought Reasoning for Traceable Clinical Tumor Analysis: This paper proposes TumorChain, an interleaved multimodal chain-of-thought reasoning framework for tumor analysis across five major digestive organs. It integrates a knowledge graph-driven 1.5M CoT-VQA data engine, organ-guided iterative interleaved reasoning (IIR), and joint optimization of segmentation, classification, and LLM models to realize a complete reasoning chain from imaging findings → clinical impressions → pathological predictions, achieving an average accuracy of 84.41% and substantially outperforming GPT-5-Mini (51.59%).
Understanding the Role of Training Data in Test-Time Scaling: This paper theoretically analyzes how training data properties affect test-time scaling, proves that CoT reasoning is equivalent to pseudo-Newton method iteration, proposes a task hardness measure based on the minimum eigenvalue of feature covariance, reveals the mechanism behind the "more thinking is not always better" overthinking phenomenon, and derives an optimal task selection strategy for multi-task training — training sets should be diverse, relevant, and difficult.
Uni-CoT: Towards Unified Chain-of-Thought Reasoning Across Text and Vision: This paper proposes Uni-CoT, a hierarchical macro-micro reasoning framework that decomposes multimodal CoT into macro-level task planning (decomposing complex tasks into sub-goals) and micro-level sub-task execution (MDP-style self-reflective iterative refinement). Through an attention mask design, the complexity is reduced from $O(T^2)$ to $O(T)$. The method surpasses the BAGEL baseline by +0.02 on GenEval, achieving unified reasoning over interleaved text and images.
Verifying Chain-of-Thought Reasoning via Its Computational Graph: This paper proposes CRV (Circuit-based Reasoning Verification), which constructs interpretable attribution graphs by replacing LLM MLPs with transcoders, extracts structural "fingerprints" of reasoning errors from these graphs, and enables white-box CoT reasoning verification with the capacity to correct erroneous reasoning via causal intervention.
When Reasoning Meets Compression: Understanding the Effects of LLMs Compression on Large Reasoning Models: This paper presents a systematic benchmark and mechanistic analysis of the effects of compression (quantization/distillation/pruning) on large reasoning models (LRMs), yielding three core findings: parameter count affects knowledge memorization more than reasoning ability; the last-layer MLP up_proj of distilled models is the most critical weight; protecting only 2% of over-compressed weights improves average accuracy by 6.57%.
When Reasoning Meets Compression: Understanding the Effects of LLMs Compression on Large Reasoning Models: This paper systematically studies the effects of three compression methods—quantization, distillation, and pruning—on Large Reasoning Models (LRMs) through performance benchmarking and mechanistic interpretability analysis. Key findings include: parameter count affects knowledge memorization more than reasoning ability; the last-layer MLP up_proj is the most critical component; and current quantization methods over-compress the final layers.
When Shallow Wins: Silent Failures and the Depth-Accuracy Paradox in Latent Reasoning: This paper systematically analyzes the latent reasoning behavior of Qwen2.5-Math-7B on GSM8K, finding that 81.6% of correct predictions arise from computationally inconsistent paths, 8.8% constitute silent failures (high-confidence errors), and revealing a paradoxical relationship between reasoning depth and accuracy.
Why is Your Language Model a Poor Implicit Reward Model?: This paper provides theoretical and empirical evidence that implicit reward models (IM-RM, e.g., DPO) generalize worse than explicit reward models (EX-RM) because IM-RM overfits to surface-level token cues rather than semantic representations, leading to substantial accuracy degradation under token distribution shift. The paper also refutes the "generation–verification gap" hypothesis.

🧊 3D Vision¶

3DGEER: 3D Gaussian Rendering Made Exact and Efficient for Generic Cameras: This paper proposes 3DGEER, a framework that derives a closed-form solution for integrating Gaussian density along rays, designs a Particle Bounding Frustum (PBF) for accurate and efficient ray–particle association, and introduces Bipolar Equal-Angle Projection (BEAP) to unify wide-FoV camera representations. 3DGEER achieves geometrically exact and real-time efficient 3D Gaussian rendering under arbitrary camera models, outperforming existing methods comprehensively on both fisheye and pinhole datasets.
A Genetic Algorithm for Navigating Synthesizable Molecular Spaces: This paper proposes SynGA, a genetic algorithm that operates directly on synthesis routes (synthesis trees), constraining the search strictly within synthesizable molecular space via custom crossover and mutation operators. Combined with ML-guided building block filtering, SynGA achieves state-of-the-art performance on synthesizable analog search and property optimization.
A Step to Decouple Optimization in 3DGS: This paper provides an in-depth analysis of two overlooked coupling issues in 3DGS optimization — update-step coupling (implicit updates and momentum rescaling for invisible viewpoints) and gradient coupling (entanglement of regularization and photometric loss in Adam momentum) — and proposes AdamW-GS by decoupling and recombining these components, simultaneously improving reconstruction quality and reducing redundant primitives without additional pruning operations.
Augmented Radiance Field: A General Framework for Enhanced Gaussian Splatting: This paper proposes the Augmented Radiance Field (ARF) framework, which explicitly models specular components by designing augmented Gaussian kernels with view-dependent opacity. An error-driven compensation strategy is introduced (2D Gaussian initialization → inverse projection to 3D → joint optimization) to enhance existing 3DGS scenes as a plug-and-play post-processing step. The method surpasses state-of-the-art NeRF approaches on multiple benchmarks while requiring only second-order spherical harmonics to capture complex illumination.
Brain-IT: Image Reconstruction from fMRI via Brain-Interaction Transformer: This paper proposes Brain-IT, a framework that employs a brain-inspired Brain Interaction Transformer (BIT) to cluster functionally similar brain voxels into cross-subject shared Brain Tokens, from which localized semantic and structural image features are predicted, enabling high-fidelity reconstruction of images from fMRI signals. With only 1 hour of data, Brain-IT achieves performance comparable to prior methods trained on 40 hours of data.
CloDS: Visual-Only Unsupervised Cloth Dynamics Learning in Unknown Conditions: CloDS proposes the first framework for unsupervised cloth dynamics learning from multi-view videos. By introducing Spatial Mapping Gaussian Splatting (SMGS) to establish a differentiable mapping between 2D images and 3D meshes, combined with dual-position opacity modulation to address self-occlusion, the method enables a GNN to learn cloth dynamics approaching fully supervised performance without any physical parameter supervision.
Color3D: Controllable and Consistent 3D Colorization with Personalized Colorizer: Color3D introduces a paradigm of "colorize one key view → fine-tune a personalized colorizer → propagate colors to all views and timesteps," reducing the complex 3D colorization problem to single-image colorization plus color propagation. It achieves rich colorization, cross-view consistency, and user controllability simultaneously on both static and dynamic 3D scenes.
COOPERTRIM: Adaptive Data Selection for Uncertainty-Aware Cooperative Perception: CooperTrim is an adaptive feature selection framework that evaluates feature relevance via conformal temporal uncertainty estimation and dynamically determines the sharing volume through a data-driven mechanism. It achieves 80.28% bandwidth reduction with comparable performance on cooperative semantic segmentation, and is the first to apply selective sharing to cooperative segmentation tasks.
CORE-3D: Context-aware Open-vocabulary Retrieval by Embeddings in 3D: This paper proposes CORE-3D, a training-free open-vocabulary 3D semantic segmentation and natural language object retrieval pipeline that achieves state-of-the-art performance on Replica and ScanNet through progressive multi-granularity mask generation, context-aware CLIP encoding, and multi-view 3D fusion.
CRISP: Contact-Guided Real2Sim from Monocular Video with Planar Scene Primitives: This paper proposes CRISP, a method for recovering simulatable human motion and scene geometry from monocular video. By fitting planar primitives to obtain clean, simulation-ready geometry and leveraging human-scene contact modeling to reconstruct occluded regions, CRISP reduces the motion tracking failure rate of a humanoid controller from 55.2% to 6.9%.
Ctrl&Shift: High-Quality Geometry-Aware Object Manipulation in Visual Generation: This paper proposes Ctrl&Shift, an end-to-end diffusion framework that decomposes object manipulation into object removal and reference-guided inpainting, and injects relative camera pose control, achieving geometry-consistent fine-grained object manipulation for the first time without relying on explicit 3D reconstruction.
D-REX: Differentiable Real-to-Sim-to-Real Engine for Learning Dexterous Grasping: D-REX is proposed as a Gaussian-based differentiable real-to-sim-to-real engine that performs end-to-end object mass identification from visual observations and robot control signals, and leverages the identified mass for force-aware dexterous grasping policy learning, effectively bridging the sim-to-real gap.
DiffWind: Physics-Informed Differentiable Modeling of Wind-Driven Object Dynamics: This paper proposes DiffWind, a physics-constrained differentiable framework that models wind as a grid-based physical field, represents objects as a 3D Gaussian Splatting particle system, simulates wind–object interaction via the Material Point Method (MPM), and incorporates the Lattice Boltzmann Method (LBM) as a physical constraint. The framework jointly reconstructs wind force fields and object motion from video, supports forward simulation under novel wind conditions and wind retargeting, and significantly outperforms existing dynamic scene modeling methods on the newly introduced WD-Objects dataset.
Dynamic Novel View Synthesis in High Dynamic Range: This paper is the first to formally define the HDR Dynamic Novel View Synthesis (HDR DNVS) problem and proposes the HDR-4DGS framework. Through a dynamic tone mapping module, the framework achieves temporally consistent HDR radiance field reconstruction in time-varying scenes, outperforming existing methods on both synthetic and real-world datasets.
Efficient-LVSM: Faster, Cheaper, and Better Large View Synthesis Model via Decoupled Co-Refinement Attention: This paper proposes Efficient-LVSM, a dual-stream architecture that decouples input view encoding from target view generation, reducing the complexity of novel view synthesis from $O(N_{in}^2)$ to $O(N_{in})$. On RealEstate10K, the model achieves state-of-the-art performance (29.86 dB PSNR) using only 50% of LVSM's training time, with a 4.4× inference speedup.
EgoNight: Towards Egocentric Vision Understanding at Night with a Challenging Benchmark: This paper introduces EgoNight, the first systematic nighttime egocentric vision benchmark, comprising day-night aligned videos and 3,658 manually verified QA pairs. It reveals that MLLMs suffer up to 32.8% performance degradation under low-light conditions.
EgoWorld: Translating Exocentric View to Egocentric View using Rich Exocentric Observations: EgoWorld proposes an end-to-end exocentric-to-egocentric view translation framework that extracts three complementary observations from a single third-person image—3D point clouds, hand poses, and text descriptions—projects the point cloud to obtain a sparse egocentric RGB map, and reconstructs a complete high-fidelity egocentric image via diffusion-based inpainting, achieving state-of-the-art performance across four datasets under diverse unseen settings.
Einstein Fields: A Neural Perspective To Computational General Relativity: This paper proposes EinFields, the first framework to apply neural implicit representations to the compression of four-dimensional general relativity simulations. By encoding the metric tensor field as compact neural network weights, it achieves 4000× storage compression and 5–7 digits of numerical precision, while tensor derivatives obtained via automatic differentiation are 5 orders of magnitude more accurate than those from finite differences.
Fast Estimation of Wasserstein Distances via Regression on Sliced Wasserstein Distances: Leveraging the mathematical property that standard Sliced Wasserstein (SW) distances provide lower bounds and lifted SW distances provide upper bounds for the Wasserstein distance, this paper constructs a minimal linear regression model (the RG framework) that estimates Wasserstein distances with high accuracy using only a small number of exact Wasserstein labels as supervision, comprehensively outperforming the Transformer-based method Wasserstein Wormhole in low-data regimes.
FastGHA: Generalized Few-Shot 3D Gaussian Head Avatars with Real-Time Animation: This paper proposes FastGHA, a feed-forward few-shot 3D Gaussian head avatar generation framework that reconstructs an animatable 3D Gaussian head from 4 arbitrary-expression/viewpoint input images in ~1 second, supporting real-time animation at 62 FPS. On Ava-256, it achieves a PSNR of 22.5 dB, surpassing Avat3r's 20.7 dB while being 7.75× faster.
Fused-Planes: Why Train a Thousand Tri-Planes When You Can Share?: This paper proposes Fused-Planes, which decomposes the Tri-Plane representation into shared class-level basis planes (macro) and object-specific detail planes (micro) via a macro-micro decomposition. Combined with latent-space rendering, the method achieves 7× training speedup and 3× memory reduction while maintaining or surpassing the reconstruction quality of independently trained Tri-Planes.
Generalizable Coarse-to-Fine Robot Manipulation via Language-Aligned 3D Keypoints: CLAP (Coarse-to-fine Language-Aligned manipulation Policy) achieves strong generalization to novel instructions and unseen environments through three core components: task decomposition, VLM fine-tuning for 3D keypoint prediction, and 3D-aware representation. It outperforms the state of the art by 12% on GemBench using only 1/5 of the training data.
GeoPurify: A Data-Efficient Geometric Distillation Framework for Open-Vocabulary 3D Segmentation: GeoPurify is proposed as a framework that purifies noisy features projected from 2D VLMs into 3D by distilling geometric priors from a 3D self-supervised teacher model, achieving performance on par with or superior to full-data SOTA open-vocabulary 3D segmentation using only ~1.5% of training data.
GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra: This work introduces the GIQ benchmark, comprising 224 synthetic and real polyhedra, and systematically evaluates the geometric reasoning capabilities of vision foundation models across four tasks—monocular 3D reconstruction, symmetry detection, mental rotation testing, and zero-shot classification—revealing significant deficiencies in the geometric understanding of current models.
HDR-NSFF: High Dynamic Range Neural Scene Flow Fields: This paper proposes HDR-NSFF, which shifts HDR video reconstruction from the conventional 2D pixel-level fusion paradigm to 4D spatiotemporal modeling. From alternating-exposure monocular videos, it jointly reconstructs HDR radiance fields, 3D scene flow, geometry, and tone-mapping, enabling temporally and spatially consistent dynamic HDR novel view synthesis.
Improving Long-Range Interactions in Graph Neural Simulators via Hamiltonian Dynamics: This paper proposes Information-preserving Graph Neural Simulators (IGNS), which leverage port-Hamiltonian dynamical structure to prevent information dissipation on graphs. Combined with warmup initialization, geometric encoding, and multi-step training objectives, IGNS consistently outperforms existing graph neural simulators across 6 physics simulation benchmarks.
Into the Rabbit Hull: From Task-Relevant Concepts in DINO to Minkowski Geometry: This paper trains a sparse autoencoder (SAE) on DINOv2 to extract a dictionary of 32,000 visual concepts, systematically investigates how different downstream tasks (classification / segmentation / depth estimation) selectively recruit subsets of these concepts, reveals that the geometry of the representation space goes beyond the Linear Representation Hypothesis (LRH), and proposes a novel Minkowski Representation Hypothesis (MRH) positing that tokens are superpositions of multiple convex mixtures.
Into the Rabbit Hull: From Task-Relevant Concepts in DINO to Minkowski Geometry: By training a 32,000-unit Sparse Autoencoder dictionary on DINOv2, this work systematically analyzes how downstream tasks recruit distinct concepts, reveals that representational geometry deviates from the Linear Representation Hypothesis (LRH), and proposes the Minkowski Representation Hypothesis (MRH), which posits that token representations are Minkowski sums of multiple convex polytopes, with concepts defined by proximity to prototype points rather than linear directions.
Joint Shadow Generation and Relighting via Light-Geometry Interaction Maps: This paper proposes Light-Geometry Interaction (LGI) maps, a 2.5D representation encoding light-occlusion relationships derived from monocular depth estimation. Embedded within a bridge matching generative framework, LGI maps enable joint modeling of shadow generation and object relighting, achieving state-of-the-art performance on both synthetic and real images.
LaVCa: LLM-assisted Visual Cortex Captioning: This paper proposes LaVCa, a method that leverages LLMs to generate natural language captions for individual voxels in the human visual cortex. Through a four-step pipeline—encoding model construction, optimal image selection, MLLM-based captioning, and LLM-driven keyword extraction with sentence composition—LaVCa reveals voxel-level visual selectivity more accurately and with greater semantic diversity than the prior method BrainSCUBA.
Learning Part-Aware Dense 3D Feature Field for Generalizable Articulated Object Manipulation: This paper proposes PA3FF (Part-Aware 3D Feature Field), a natively 3D dense part-aware feature representation. By combining a Sonata pre-trained backbone with geometric and semantic contrastive learning, PA3FF yields zero-shot part-level features. Paired with a Part-Aware Diffusion Policy (PADP), the system achieves few-shot, highly generalizable articulated object manipulation, substantially outperforming baselines such as CLIP, DINOv2, and GenDP in both simulation and real-world settings.
Learning Physics-Grounded 4D Dynamics with Neural Gaussian Force Fields: This paper proposes the NGFF framework, which reconstructs 3D Gaussian representations from multi-view RGB images and learns explicit neural force fields to drive physics-based dynamics. By solving ODEs, the framework enables interactive, physically plausible 4D video generation that is two orders of magnitude faster than traditional Gaussian simulators, surpassing Veo3 and NVIDIA Cosmos.
Learning Unified Representation of 3D Gaussian Splatting: The native 3DGS parameters $\boldsymbol{\theta}=\{\mu,\mathbf{q},\mathbf{s},\mathbf{c},o\}$ suffer from non-uniqueness and numerical heterogeneity, making them unsuitable as a learning space for neural networks. This paper proposes the Submanifold Field (SF) representation: each Gaussian primitive is mapped to a continuous color field defined on its iso-probability ellipsoidal surface. The paper proves this mapping is injective, fundamentally eliminating parameter ambiguity. Combined with a VAE trained using an optimal-transport-based Manifold Distance (M-Dist), the approach comprehensively outperforms parameter-based baselines in reconstruction fidelity, cross-domain generalization, and latent space stability.
LiTo: Surface Light Field Tokenization: LiTo encodes surface light fields into compact sets of latent vectors to jointly model 3D geometry and view-dependent appearance: random subsampling of light field observations from RGB-D multi-view images → Perceiver IO encoder (3D local attention supporting 1M token input) + flow-matching geometry decoder + higher-order spherical harmonic Gaussian decoder → achieves reconstruction and single-image-to-3D generation surpassing TRELLIS, and for the first time models view-dependent effects such as specular highlights and Fresnel reflectance within a latent 3D representation.
MEGS2: Memory-Efficient Gaussian Splatting via Spherical Gaussians and Unified Pruning: This paper proposes MEGS2, a method that compresses 3DGS from the perspective of rendering VRAM: it replaces spherical harmonics (SH) entirely with prunable, arbitrarily oriented spherical Gaussians (SG) to reduce per-primitive parameter count, and formulates the joint pruning of primitive count and lobe count as a single memory-constrained optimization problem via a unified soft pruning framework. The result is an 8× reduction in static VRAM and a 6× reduction in rendering VRAM with preserved rendering quality, enabling real-time 3DGS on mobile devices for the first time.
Mono4DGS-HDR: High Dynamic Range 4D Gaussian Splatting from Alternating-exposure Monocular Videos: This work is the first to address the problem of reconstructing renderable 4D HDR scenes from pose-free alternating-exposure monocular videos. Through a two-stage optimization pipeline (orthographic video space → world space), a Video-to-World Gaussian transformation strategy, and temporal luminance regularization, the method achieves 37.64 dB HDR PSNR and 161 FPS on synthetic data, comprehensively outperforming existing approaches.
MultiMat: Multimodal Program Synthesis for Procedural Materials using Large Multimodal Models: This paper presents MultiMat, the first framework to apply large multimodal models (LMMs) to procedural material node graph synthesis. By incorporating intermediate visual rendering feedback of partially generated nodes into the autoregressive generation process (via two conditioning modes: mixed and graph), and pairing this with an incremental constrained tree search for on-the-fly validation and backtracking, MultiMat is trained on 6,878 production-grade Substance Designer materials and substantially outperforms text-only baselines in both unconditional and conditional generation.
NOVA3R: Non-pixel-aligned Visual Transformer for Amodal 3D Reconstruction: This paper proposes NOVA3R — a non-pixel-aligned amodal 3D reconstruction framework from pose-free images. It employs learnable scene tokens to aggregate global information across views and a flow-matching-based diffusion 3D decoder to generate complete point clouds (including occluded regions). The method addresses two fundamental limitations of pixel-aligned approaches — inability to reconstruct occluded surfaces and redundant geometry in overlapping regions — and outperforms prior SOTA on scene-level and object-level benchmarks including SCRREAM and GSO.
Omni-View: Unlocking How Generation Facilitates Understanding in Unified 3D Model based on Multiview images: This paper presents Omni-View, a unified 3D scene understanding and generation model that enhances understanding performance through a texture module (novel view synthesis) and a geometry module (depth/pose estimation), achieving a score of 55.4 on VSI-Bench and surpassing all existing specialized 3D understanding models.
One2Scene: Geometric Consistent Explorable 3D Scene Generation from a Single Image: One2Scene proposes a three-stage framework that decomposes single-image explorable 3D scene generation into: panorama generation → feed-forward 3D Gaussian splatting for geometric scaffold construction → scaffold-guided novel view synthesis. By reformulating panoramic depth estimation as a multi-view stereo matching problem, the method achieves geometrically consistent and freely explorable 3D scene generation.
One2Scene: Geometric Consistent Explorable 3D Scene Generation from a Single Image: This paper proposes One2Scene, which decomposes the ill-posed problem of generating an explorable 3D scene from a single image into three sub-tasks: (1) panorama generation to expand visual coverage, (2) a feed-forward 3DGS network that constructs an explicit 3D geometric scaffold from sparse anchor views, and (3) scaffold-guided novel view synthesis via Dual-LoRA that fuses high-quality anchor views with geometric priors. The method achieves geometrically consistent and photorealistic scene generation under large viewpoint changes, significantly outperforming state-of-the-art methods.
OpenFly: A Comprehensive Platform for Aerial Vision-Language Navigation: This paper presents OpenFly, a comprehensive platform for aerial vision-language navigation (VLN) that integrates four rendering engines (UE / GTA V / Google Earth / 3DGS), develops a fully automated data generation pipeline (point cloud acquisition → semantic segmentation → trajectory generation → GPT-4o instruction synthesis), constructs a large-scale dataset of 100K trajectories across 18 scenes, and proposes a keyframe-aware VLN model (OpenFly-Agent) combining keyframe selection with visual token merging. OpenFly-Agent outperforms existing methods by 14.0% and 7.9% in success rate on seen and unseen scenes, respectively.
PartSAM: A Scalable Promptable Part Segmentation Model Trained on Native 3D Data: This paper proposes PartSAM, the first promptable part segmentation model trained on large-scale native 3D data. It employs a triplane dual-branch encoder (frozen SAM priors + learnable 3D branch) and a SAM-style decoder. A model-in-the-loop annotation pipeline is used to construct 5M+ shape–part pairs. Under open-world settings, a single click from PartSAM outperforms Point-SAM by over 90% in IoU@1.
PD²GS: Part-Level Decoupling and Continuous Deformation of Articulated Objects via Gaussian Splatting: PD²GS is proposed as a framework that learns a shared canonical Gaussian field and models each interaction state as a continuous deformation thereof, enabling part-level decoupling, reconstruction, and continuous control of articulated objects via coarse-to-fine motion trajectory clustering and SAM-guided boundary refinement, without any manual supervision.
Peering into the Unknown: Active View Selection with Neural Uncertainty Maps for 3D Reconstruction: This paper proposes PUN (Peering into the UnkNowN), which employs a lightweight feed-forward network, UPNet, to directly predict the uncertainty distribution over all candidate viewpoints on a sphere from a single image — termed a neural uncertainty map (UMap) — thereby replacing the conventional iterative active view selection pipeline that requires repeated retraining of NeRF or 3DGS models. PUN achieves comparable reconstruction quality using only half the viewpoints of the upper bound, while delivering a 400× speedup and over 50% reduction in computational resource consumption during view selection.
pySpatial: Generating 3D Visual Programs for Zero-Shot Spatial Reasoning: pySpatial is a visual programming framework that enables MLLMs to generate Python code that automatically invokes 3D spatial tools (3D reconstruction, camera pose estimation, novel view synthesis, etc.), transforming limited 2D image inputs into interactively explorable 3D scenes. The framework achieves zero-shot, plug-and-play explicit 3D spatial reasoning, attaining an overall accuracy of 58.56% on the MindCube benchmark—surpassing GPT-4.1-mini by 12.94% and VLM-3R by 16.5%—while also successfully driving a real quadruped robot to perform indoor navigation.
QuadGPT: Native Quadrilateral Mesh Generation with Autoregressive Models: This paper presents QuadGPT—the first end-to-end autoregressive framework for native quadrilateral mesh generation. It achieves comprehensive superiority over existing triangle-to-quad conversion pipelines and cross-field-guided methods in Chamfer Distance, Hausdorff Distance, quad ratio, and user preference, via unified mixed-topology tokenization (padding triangular faces into 4-vertex blocks), a Hourglass Transformer architecture, and topology-reward-based truncated DPO (tDPO) fine-tuning.
Quantized Visual Geometry Grounded Transformer: To address the deployment demands of the billion-scale 3D reconstruction model VGGT, this paper proposes QuantVGGT, the first dedicated PTQ framework for VGGT. It resolves heavy-tailed activation distributions caused by special tokens via dual-smoothed fine-grained quantization (Hadamard rotation + channel-wise smoothing), and addresses calibration instability via noise-filtered diverse sampling. At 4-bit quantization, the method achieves 3.7× memory compression and 2.5× inference speedup while retaining 98%+ accuracy.
RadioGS: Radiometrically Consistent Gaussian Surfels for Inverse Rendering: RadioGS introduces a radiometric consistency loss that minimizes the residual between the learned radiance of each Gaussian surfel and its physically rendered radiance, providing physics-based supervision for unobserved directions. This forms a self-correcting feedback loop that enables accurate indirect illumination and material decomposition, while supporting relighting in minutes.
Scaling Sequence-to-Sequence Generative Neural Rendering: This paper presents Kaleido, a family of decoder-only rectified flow transformers that treats 3D as a special subdomain of video. Through Unified Positional Encoding, a masked autoregressive framework, and a video pretraining strategy, Kaleido achieves "any-to-any" 6-DoF novel view synthesis without any explicit 3D representation. It is the first generative method to match per-scene optimization (InstantNGP) in rendering quality under multi-view settings, and scales resolution from 512/576px to 1024px.
SceneTransporter: Optimal Transport-Guided Compositional Latent Diffusion for Single-Image Structured 3D Scene Generation: SceneTransporter reformulates open-world structured 3D scene generation as a global correspondence assignment problem by introducing an entropic optimal transport (OT) framework into the denoising loop of a compositional 3D latent diffusion model. The OT plan gates cross-attention to enforce exclusive patch-to-part routing (preventing feature entanglement), while edge-regularized assignment costs encourage clean instance separation at image boundaries. The approach achieves state-of-the-art instance-level consistency and geometric fidelity on 74 diverse open-world scene images.
Sharp Monocular View Synthesis in Less Than a Second: SHARP generates approximately 1.2 million 3D Gaussians from a single image via a single feedforward pass, completing inference in under one second on an A100 GPU with rendering speeds exceeding 100 FPS. It achieves state-of-the-art zero-shot generalization across 6 datasets, reducing LPIPS by 25–34% and synthesis time by three orders of magnitude compared to the strongest prior method.
Splat and Distill: Augmenting Teachers with Feed-Forward 3D Reconstruction for 3D-Aware Distillation: Within a student-teacher distillation framework, this work augments the teacher with a pretrained feed-forward 3D reconstruction model (MVSplat) that lifts 2D features into a 3D Gaussian representation and renders them to novel viewpoints, enabling the student to learn geometrically consistent, 3D-aware 2D features. The proposed method surpasses existing approaches across downstream tasks including depth estimation, surface normal estimation, semantic segmentation, and multi-view correspondence.
Splat Feature Solver: This paper unifies the feature lifting problem for 3D splat representations as a sparse linear inverse problem $AX=B$, proposes a closed-form solver with a provable $(1+\beta)$-approximation error bound under convex loss, and introduces two regularization strategies—Tikhonov Guidance and Post-Lifting Aggregation—achieving state-of-the-art performance on open-vocabulary 3D segmentation.
Station2Radar: Query-Conditioned Gaussian Splatting for Precipitation Field: This paper proposes Query-Conditioned Gaussian Splatting (QCGS), the first method to introduce 2D Gaussian Splatting into precipitation field generation. By fusing satellite imagery with sparse automatic weather station (AWS) observations, QCGS achieves flexible-resolution precipitation field reconstruction without radar input, reducing RMSE by over 50% compared to conventional gridded products.
StreamSplat: Towards Online Dynamic 3D Reconstruction from Uncalibrated Video Streams: StreamSplat proposes a fully feed-forward online dynamic 3D reconstruction framework that enables instant generation of dynamic 3DGS representations from uncalibrated video streams, achieving 1200× speedup over optimization-based methods through three key innovations: probabilistic position sampling, bidirectional deformation fields, and adaptive Gaussian fusion.
Stroke3D: Lifting 2D Strokes into Rigged 3D Model via Latent Diffusion Models: Stroke3D is the first method to generate rigged 3D mesh models directly from user-drawn 2D strokes and text prompts. It employs a skeleton-first two-stage pipeline: a graph VAE and graph DiT are used to generate controllable 3D skeletons, followed by TextuRig dataset augmentation and SKA-DPO optimization to synthesize high-quality meshes.
Stylos: Multi-View 3D Stylization with Single-Forward Gaussian Splatting: Stylos proposes a single-forward 3D style transfer framework that achieves zero-shot 3D stylization from uncalibrated inputs via a dual-path design with a shared Transformer backbone (geometry self-attention + style cross-attention) and a voxel-level 3D style loss, supporting scalability from single-view to hundreds of views.
SurfSplat: Conquering Feedforward 2D Gaussian Splatting with Surface Continuity Priors: SurfSplat proposes a feedforward 3D reconstruction framework based on 2DGS that binds Gaussian rotation and scale to local neighborhood positions via surface continuity priors, resolves color bias through a forced alpha blending strategy, and introduces the HRRC metric to reveal reconstruction quality discrepancies at high resolutions.
Topology-Preserved Auto-regressive Mesh Generation in the Manner of Weaving Silk: This paper proposes a "silk-weaving"-inspired mesh tokenization algorithm that provides a canonical topological framework through vertex layering and ordering, guaranteeing manifoldness, watertightness, normal consistency, and part-awareness in generated meshes while achieving state-of-the-art compression efficiency.
UFO-4D: Unposed Feedforward 4D Reconstruction from Two Images: This paper proposes UFO-4D, a unified feedforward framework that directly predicts dynamic 3D Gaussian representations from two unposed images, enabling jointly consistent estimation of 3D geometry, 3D motion, and camera pose, achieving up to 3× improvement over existing methods on geometry and motion benchmarks.
Uncertainty Matters in Dynamic Gaussian Splatting for Monocular 4D Reconstruction: This paper proposes USplat4D, an uncertainty-aware dynamic Gaussian splatting framework that estimates per-Gaussian time-varying uncertainty scores and constructs uncertainty-guided spatiotemporal graphs to propagate reliable motion cues, substantially improving monocular 4D reconstruction quality in occluded regions and under extreme novel viewpoints.
Universal Beta Splatting: This paper proposes Universal Beta Splatting (UBS), which generalizes 3D Gaussian Splatting to an N-dimensional anisotropic Beta kernel. By enabling per-dimension shape control, UBS unifies spatial geometry, view-dependent appearance, and scene dynamics within a single representation, achieving interpretable scene decomposition and state-of-the-art rendering quality.
UrbanGS: A Scalable and Efficient Architecture for Geometrically Accurate Large-Scene Reconstruction: This paper proposes UrbanGS, a scalable 3DGS reconstruction framework for urban-scale scenes that simultaneously improves geometric accuracy, rendering quality, and memory efficiency through depth-consistent D-Normal regularization, spatially adaptive Gaussian pruning (SAGP), and a unified partitioning strategy.
Weight Space Representation Learning on Diverse NeRF Architectures: This paper proposes the first representation learning framework capable of processing weights from diverse NeRF architectures (MLP / tri-plane / hash table). By combining a Graph Meta-Network (GMN) encoder with a SigLIP contrastive loss, it constructs an architecture-agnostic latent space, enabling classification, retrieval, and language-grounded tasks across 13 NeRF architectures, with generalization to architectures unseen during training.

📊 LLM Evaluation¶

Accessible, Realistic, and Fair Evaluation of Positive-Unlabeled Learning Algorithms: This paper proposes the first unified benchmark for PU learning and systematically addresses two critical issues: (1) enabling model selection without negative samples via proxy accuracy and proxy AUC; (2) identifying and resolving intra-dataset label shift in the one-sample setting through a simple calibration strategy that merges positive samples into the unlabeled set, enabling fair comparison of two-sample algorithms under one-sample evaluation.
AnesSuite: A Comprehensive Benchmark and Dataset Suite for Anesthesiology Reasoning: This paper introduces AnesSuite, the first comprehensive dataset suite for anesthesiology reasoning, comprising AnesBench—an evaluation benchmark of 7,972 bilingual multiple-choice questions organized into three cognitive difficulty levels—and three training datasets (AnesCorpus/AnesQA/AnesR1). The Morpheus models trained on this suite via SFT+GRPO enable a 7B model to match a 14B baseline, while revealing significant bottlenecks of state-of-the-art LLMs on complex clinical reasoning (System 2).
ASIDE: Architectural Separation of Instructions and Data in Language Models: This paper proposes ASIDE, an architectural modification that distinguishes instructions from data at the token embedding level via orthogonal rotation. Requiring only changes to the forward pass and training on standard instruction fine-tuning data, ASIDE significantly improves instruction-data separation and robustness against prompt injection without any dedicated safety training.
AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite: The AI2 team identifies five methodological flaws in existing scientific research agent benchmarks and introduces AstaBench, the first agent evaluation suite covering the full scientific research pipeline. AstaBench comprises 4 categories and 11 sub-benchmarks with 2,400+ questions, a production-grade controllable search tool backed by Semantic Scholar, and 9 research-optimized Asta Agent baselines. It conducts the largest systematic evaluation to date across 57 agents (22 types), finding that despite progress on individual tasks such as literature retrieval, AI remains far from meeting the demands of end-to-end scientific research assistance.
Benchmarking Overton Pluralism in LLMs: This paper proposes the OvertonBench framework, which formalizes Overton pluralism as a set-coverage metric called OvertonScore through a large-scale human study (1,208 demographically representative U.S. participants, 60 subjective questions, 8 LLMs). All evaluated models score only 0.35–0.41 (theoretical maximum: 1.0), and an automated evaluation tool achieving high correlation with human judgments (ρ=0.88) is constructed.
BiasScope: Towards Automated Detection of Bias in LLM-as-a-Judge Evaluation: This paper proposes BiasScope, a fully LLM-driven iterative framework that automatically discovers previously unknown biases in LLM-as-a-Judge evaluation at scale. Based on the discovered biases, the authors construct JudgeBench-Pro, a more challenging benchmark on which even powerful LLM judges exceed 50% error rate.
Biologically Plausible Online Hebbian Meta-Learning: Two-Timescale Local Rules for Spiking Neural Brain Interfaces: This paper proposes an online SNN decoder that eliminates BPTT by combining three-factor Hebbian local learning rules with dual-timescale eligibility traces and adaptive learning rate control. The approach achieves neural decoding accuracy comparable to offline-trained methods (Pearson R ≥ 0.63/0.81) under O(1) memory complexity, and demonstrates continuous adaptation to non-stationary neural signals in closed-loop simulations.
Breaking the Correlation Plateau: On the Optimization and Capacity Limits of Attention-Based Regressors: This paper provides the first theoretical analysis of the "PCC plateau" phenomenon observed when training attention-based regression models with a joint MSE+PCC objective. The root causes are identified as the conflict between MSE optimization and PCC gradients, together with an expressivity upper bound imposed by the convex aggregation of softmax. The authors propose the ECA (Extrapolative Correlation Attention) framework, which breaks through this limitation via three components: scaled residual aggregation, dispersion-aware temperature softmax, and dispersion-normalized PCC loss.
Can Vision–Language Models Assess Graphic Design Aesthetics? A Benchmark, Evaluation, and Dataset Perspective: This paper proposes AesEval-Bench, the first benchmark for systematically evaluating VLMs on graphic design aesthetics (4 dimensions × 12 indicators × 3 tasks). It finds that existing VLMs—including reasoning-augmented models—perform poorly on design aesthetics, and constructs training data via human-guided VLM labeling combined with indicator-grounded reasoning. Fine-tuning a 7B model with this data surpasses GPT-5 on the precise localization task.
Can You Hear Me Now? A Benchmark for Long-Range Graph Propagation and Beyond: This paper proposes the ECHO benchmark, comprising 3 synthetic tasks and 2 real-world chemistry tasks grounded in density functional theory (DFT), requiring graph neural networks to propagate information effectively over 17–40 hops. The benchmark systematically evaluates the long-range propagation capabilities of 11 GNN architectures.
Conformal Prediction Adaptive to Unknown Subpopulation Shifts: To address the failure of standard conformal prediction under subpopulation shift, this paper proposes three adaptive algorithms: weighting calibration data via a learned domain classifier (Algorithms 1/2) or via embedding similarity (Algorithm 3). Coverage guarantees are maintained even with imperfect or absent domain labels, with applications to visual classification and LLM hallucination detection.
DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science: DARE-bench is a large-scale verifiable benchmark for data science tasks, comprising 6,300 Kaggle-derived tasks that support evaluation across two dimensions—ML modeling and instruction following—along with training data for SFT and RL. SFT improves Qwen3-32B by 1.83×, while RL improves Qwen3-4B by more than 8×.
Deep FlexQP: Accelerated Nonlinear Programming via Deep Unfolding: This paper proposes FlexQP — an "always feasible" convex quadratic programming (QP) solver based on $\ell_1$ elastic relaxation — and combines it with deep unfolding to learn an LSTM feedback policy that accelerates convergence, yielding Deep FlexQP. When embedded as a submodule within an SQP framework, it solves nonlinear trajectory optimization problems 4–16× faster than OSQP, reduces safety violations in predictive safety filters by over 70%, and improves task completion rates by 43%.
Discount Model Search for Quality Diversity Optimization in High-Dimensional Measure Spaces: This paper proposes Discount Model Search (DMS), which replaces the histogram-based discrete representation in CMA-MAE with a neural network that fits a continuous, smooth discount function. This addresses the issue of search stagnation caused by distortion in high-dimensional measure spaces, and enables, for the first time, the direct use of image datasets to define measure spaces (the QDDM paradigm).
Disentangling Shared and Private Neural Dynamics with SPIRE: A Latent Modeling Framework for Deep Brain Stimulation: This paper proposes SPIRE (Shared–Private Inter-Regional Encoder), a nonlinear dual-latent-space autoencoder framework that decomposes intracranial recordings from multiple brain regions into shared and private subspaces via cross-region alignment and orthogonal disentanglement losses. Trained exclusively on baseline data, SPIRE detects frequency-dependent network reorganization induced by DBS stimulation.
Do We Really Need Permutations? Impact of Model Width on Linear Mode Connectivity: This paper empirically demonstrates that linear mode connectivity (LMC) between independently trained models can be achieved by simply increasing model width, without any parameter permutation. It further proposes Layer-wise Exponentially Weighted Connectivity (LEWC) to explain the underlying mechanism.
Enabling Fine-Grained Operating Points for Black-Box LLMs: This paper identifies that verbalized probabilities from black-box LLMs produce only 16–23 unique values (low-cardinality problem), resulting in coarse PR/ROC curves that prevent fine-grained threshold tuning. By injecting parameterized noise and an optional MLP correction, the number of unique values increases from 16 to 20,000+, matching the performance of 20-sample ensembles with only 1–2 API calls.
Function Spaces Without Kernels: Learning Compact Hilbert Space Representations: This paper proves that Function Encoders, which learn neural network basis functions, implicitly define a valid kernel, thereby bridging neural feature learning and RKHS theory. It further proposes PCA-guided compact basis selection algorithms and establishes finite-sample generalization bounds.
GuidedSampling: Steering LLMs Towards Diverse Candidate Solutions at Inference-Time: This paper proposes GuidedSampling, an inference-time algorithm that explicitly decouples the implicit exploration and generation process of repeated sampling (RS) into two stages: iteratively generating diverse problem-solving concepts/theorems, followed by generating candidate solutions conditioned on each concept. The method achieves an average improvement of ~21.6% on pass@50 and ~9.7% on pass@5 after fine-tuning.
How Reliable is Language Model Micro-Benchmarking?: This paper proposes Minimum Detectable Ability Difference (MDAD) as a meta-evaluation metric, systematically demonstrating that micro-benchmarks at extremely small scales cannot reliably distinguish model pairs with small performance gaps, and that random sampling becomes competitive with carefully designed micro-benchmark methods once the sample size reaches ~250.
Human-LLM Collaborative Feature Engineering for Tabular Learning: This paper proposes a human-LLM collaborative feature engineering framework that decouples the proposal and selection of feature operations. A Bayesian neural network models operation utility and uncertainty to guide selection, with selective human preference feedback incorporated when appropriate. The framework achieves 8.96%–11.23% average error rate reduction across 18 tabular datasets.
Improving Set Function Approximation with Quasi-Arithmetic Neural Networks: This paper proposes QUANN (Quasi-Arithmetic Neural Networks), which employs invertible neural networks to implement a learnable Kolmogorov mean as the pooling operation. It is the first to realize a machine-learning instantiation of generalized measures of central tendency. QUANN serves as a universal approximator for mean-decomposable set functions, and the learned embeddings exhibit stronger cross-task transferability.
In-Context Learning for Pure Exploration: This paper proposes ICPE (In-Context Pure Exploration), an in-context learning framework that combines supervised learning and reinforcement learning. Using a Transformer trained directly from experience, ICPE learns exploration policies for active sequential hypothesis testing and pure exploration problems, achieving near-optimal instance-adaptive algorithmic performance without explicit modeling of the information structure.
In-Context Learning of Temporal Point Processes with Foundation Inference Models: This paper proposes FIM-PP — the first foundation inference model for marked temporal point processes (MTPP). A Transformer is pretrained on 72K synthetic point processes (14.4M events) to perform in-context inference of conditional intensity functions. In zero-shot settings, FIM-PP matches the performance of specialized models trained for hours; after a few minutes of fine-tuning, it achieves state-of-the-art results on multi-event prediction across four real-world datasets.
LCA: Local Classifier Alignment for Continual Learning: This paper proposes Local Classifier Alignment (LCA), a loss function that simultaneously minimizes classification loss and loss sensitivity within local regions of class prototype Gaussian distributions. LCA addresses the classifier mismatch problem arising from incremental backbone merging in continual learning. Combined with an Incremental Merging (IM) strategy for PEFT modules, the method achieves an overall average accuracy of 85.6% across 7 benchmark datasets, substantially outperforming prior state-of-the-art methods.
LLM Unlearning with LLM Beliefs: This paper reveals that LLM unlearning methods such as GA and NPO suffer from a squeezing effect—reducing the probability of a target response causes probability mass to redistribute toward semantically related high-likelihood regions, resulting in spurious unlearning. The authors propose a bootstrapping-based framework that leverages the model's own high-confidence predictions (model beliefs) as additional unlearning targets. Two instantiations, BS-T (token-level) and BS-S (sequence-level), achieve more thorough unlearning while preserving model utility across multiple benchmarks including TOFU, MUSE, and WMDP.
Measuring Uncertainty Calibration: For the problem of estimating the $L_1$ calibration error of binary classifiers from finite samples, this paper proposes the first non-asymptotic, distribution-free certifiable upper bound methods under two structural assumptions—bounded variation and bounded derivatives—where the latter can be guaranteed by applying a small perturbation to classifier outputs. Experiments demonstrate that the calibration error upper bound can be controlled to approximately 0.02 with $10^7$ samples.
Mitigating Spurious Correlation via Distributionally Robust Learning with Hierarchical Ambiguity Sets: A hierarchical DRO framework is proposed to simultaneously capture inter-group (group proportion shifts) and intra-group (intra-group distributional shifts) uncertainty. By defining intra-group ambiguity sets in the semantic space via the $W_\infty$ distance, the method achieves state-of-the-art performance on standard benchmarks and maintains strong robustness under a newly designed minority group distributional shift setting where all competing methods fail.
MOSIV: Multi-Object System Identification from Videos: This paper proposes MOSIV—the first complete framework for multi-object system identification from multi-view videos—comprising three stages: (1) object-aware 4D dynamic Gaussian reconstruction of per-object geometry and motion; (2) Gaussian-to-continuum lifting to construct MPM simulation particles; and (3) differentiable MPM forward rollout with geometry-alignment objectives (3D Chamfer + 2D silhouette) to back-propagate and optimize per-object continuous material parameters ($E, \nu, \mu$). On a contact-rich synthetic benchmark spanning four material types (elastic, elastoplastic, fluid, and granular), MOSIV achieves PSNR 30.51 vs. OmniPhysGS 25.93 and reduces Chamfer distance by 9.4×, establishing a new baseline for multi-object long-horizon physical simulation.
Multi-LLM Adaptive Conformal Inference for Reliable LLM Responses: This paper proposes MACI (Multi-LLM Adaptive Conformal Inference), which combines a cumulative-product conformity score, a multi-LLM ensemble for factuality scoring, and group-conditional calibration to significantly improve the retention rate of factual claims in LLM responses while strictly guaranteeing user-specified error rates.
Noise-Aware Generalization: Robustness to In-Domain Noise and Out-of-Domain Generalization: This paper is the first to formally define the Noise-Aware Generalization (NAG) problem — simultaneously pursuing in-domain robustness and out-of-domain generalization under label noise — and proposes DL4ND, a method that detects noisy labels via cross-domain comparison, achieving up to 12.5% improvement across 7 datasets.
Non-Clashing Teaching in Graphs: Algorithms, Complexity, and Bounds: This paper studies non-clashing teaching of closed-neighborhood concept classes in graphs, providing tight algorithmic bounds (a matching $2^{\mathcal{O}(|E|)}$ bound for N-NCTD⁺), FPT algorithms parameterized by treedepth and vertex cover (including the first FPT result with negative labels), and combinatorial upper bounds for planar graphs and unit square graphs, substantially advancing both the computational and combinatorial understanding of non-clashing teaching.
Optimal Transport-Induced Samples against Out-of-Distribution Overconfidence: This paper leverages the geometric singularity boundaries of semi-discrete optimal transport (OT) to locate semantically ambiguous regions in latent space, generates proxy OOD samples (OTIS) near these boundaries, and applies a confidence suppression loss during training to enforce uniform predictions in structurally uncertain regions, thereby systematically mitigating OOD overconfidence in DNNs.
PlanetAlign: A Comprehensive Python Library for Benchmarking Network Alignment: This paper presents PlanetAlign, a PyTorch-based network alignment benchmark library integrating 18 datasets across 6 domains, 14 methods spanning three categories (consistency-based, embedding-based, and optimal transport-based), and a standardized evaluation pipeline. Through large-scale systematic experiments, PlanetAlign reveals that OT-based methods (PARROT/JOENA) achieve comprehensive superiority in effectiveness, while different method categories exhibit distinct trade-offs in scalability and robustness.
Predicting LLM Reasoning Performance with Small Proxy Model: This paper proposes rBridge, which uses reasoning traces from frontier models as gold labels and applies token-level task-aligned weighted NLL, enabling small models (≤1B) to effectively predict the reasoning performance of 13B–32B models, achieving over 100× computational savings on dataset ranking tasks.
Preference Leakage: A Contamination Problem in LLM-as-a-judge: This paper is the first to formally define and systematically investigate Preference Leakage in LLM-as-a-Judge — when the synthetic data generator $M_G$ and the judge $M_J$ are related (same model / inheritance / same family), the judge exhibits systematic preference toward the "associated student model." Under the same-model scenario, PLS reaches 28.7% on Arena-Hard, and this bias is more subtle and harder to detect than egocentric bias.
Prompt and Parameter Co-Optimization for Large Language Model Task Adaptation: This paper proposes MetaTuner, a framework that employs a shared meta-encoder to simultaneously generate query-specific prompts and LoRA parameters, enabling mutual reinforcement between prompt optimization and fine-tuning. A supervised regularization loss is designed to address the mixed discrete-continuous optimization problem. MetaTuner consistently outperforms standalone prompt optimization and fine-tuning methods on MATH, GSM8K, HotpotQA, and CosmosQA.
Prompt and Parameter Co-Optimization for Large Language Models: This paper proposes MetaTuner, a framework that simultaneously generates prompts and LoRA parameters via a shared meta encoder, unifying discrete prompt optimization and continuous parameter fine-tuning into an end-to-end jointly optimizable framework, achieving substantial improvements over independently optimized methods on mathematical reasoning and question answering tasks.
RankLLM: Weighted Ranking of LLMs by Quantifying Question Difficulty: This paper proposes RankLLM, a non-parametric framework based on bidirectional score propagation over a directed bipartite graph, which jointly estimates question difficulty and model competency to achieve difficulty-aware LLM ranking, reaching 90% agreement with human judgments.
Rethinking Benign Relearning: Syntax as the Hidden Driver of Unlearning Failures: This paper identifies that the true driver of "benign relearning" in LLM machine unlearning is not topical relevance but syntactic similarity, and proposes a syntactic diversification strategy to improve unlearning robustness.
Rethinking Benign Relearning: Syntax as the Hidden Driver of Unlearning Failures: This paper reveals that the true driver of "benign relearning" in LLM machine unlearning is syntactic similarity rather than topical relevance, and proposes a syntactic diversification strategy (paraphrasing the forget set) that effectively suppresses relearning, accelerates forgetting, and alleviates the trade-off between unlearning efficacy and model utility.
Revisiting the Past: Data Unlearning with Model State History: This paper proposes MSA (Model State Arithmetic), an algorithm that leverages intermediate training checkpoints to construct "forgetting vectors" and removes the influence of specific data via parameter-space arithmetic. MSA consistently outperforms existing unlearning methods such as NPO, RMU, and GradDiff on the TOFU and RESTOR benchmarks, while maintaining model utility even without a retain set.
Same Content, Different Representations: A Controlled Study for Table QA: The first controlled study that systematically evaluates the robustness of NL2SQL, LLM, and hybrid approaches under varying table size, schema quality, and query complexity by changing only the representation format (structured vs. semi-structured) while holding table content constant, demonstrating that representation format is a first-order factor in Table QA performance.
SimpleToM: Exposing the Gap between Explicit ToM Inference and Implicit ToM Application in LLMs: SimpleToM exposes a critical gap in LLMs' Theory of Mind capabilities: frontier models can accurately infer others' mental states (explicit ToM), but performance drops sharply when this knowledge must be applied to behavior prediction and behavior judgment (applied ToM), revealing a substantial divide between "knowing what" and "knowing how to use what is known."
SimuHome: A Temporal- and Environment-Aware Benchmark for Smart Home Agents: SimuHome is a high-fidelity smart home simulator built on the Matter protocol and a 600-episode evaluation benchmark supporting dynamic environmental variable updates and time-accelerated scheduling evaluation, revealing that workflow scheduling remains the most persistent challenge for current LLM agents.
Soft Quality-Diversity Optimization: This paper proposes the Soft QD Score as a novel quality-diversity optimization objective that eliminates the need for behavior space discretization, and derives a differentiable algorithm, SQUAD, which scales more effectively to high-dimensional behavior spaces while achieving competitive performance on standard benchmarks.
Spectral Attention Steering for Prompt Highlighting: This paper proposes SEKA/AdaSEKA, which learns a "relevance subspace" via spectral decomposition of key embeddings and directly edits key vectors prior to attention computation to achieve prompt highlighting. The approach requires no storage of the full attention matrix, is fully compatible with FlashAttention, and incurs negligible overhead (+0.03s/sample).
Subliminal Signals in Preference Labels: This paper demonstrates that preference labels can serve as a covert communication channel: even when a student model generates semantically irrelevant numeric sequences, a biased judge model can transmit subliminal behavioral tendencies to the student model through binary preference labels alone, and this transmission is amplified under iterative alignment.
TabStruct: Measuring Structural Fidelity of Tabular Data: This paper proposes the TabStruct evaluation framework and a global utility metric that measures the structural fidelity of tabular data generators with respect to causal structure, without requiring ground-truth causal graphs. A systematic comparison of 13 generators across 29 datasets reveals that diffusion models significantly outperform other methods in preserving global structure.
Talk, Evaluate, Diagnose: User-aware Agent Evaluation with Automated Error Analysis: This paper proposes TED (Talk, Evaluate, Diagnose), a framework that achieves user-aware dynamic agent evaluation via general, reusable expert/non-expert persona templates; enables fine-grained efficiency assessment through grading notes, LLM-as-judge scoring, and novel metrics such as MaxProgressRate@k; and provides actionable improvement feedback via automated error discovery and clustering. Experiments on τ²-bench and ToolSandbox reveal new insights into agent performance.
Towards Anomaly-Aware Pre-Training and Fine-Tuning for Graph Anomaly Detection: This paper proposes the APF framework, which addresses the dual challenges of label scarcity and homophily disparity in graph anomaly detection through Rayleigh quotient-guided anomaly-aware pre-training and granularity-adaptive fine-tuning.
Truthfulness Despite Weak Supervision: Evaluating and Training LLMs Using Peer Prediction: This paper proposes applying the Peer Prediction mechanism from game theory to LLM evaluation and training. By measuring the mutual predictability of participants' answers, the method distinguishes honest from deceptive responses without requiring ground-truth labels, thereby incentivizing truthfulness. It exhibits a striking inverse scaling property — weaker experts are actually more resistant to deception by stronger models.
UIS-Digger: Towards Comprehensive Research Agent Systems for Real-world Unindexed Information Seeking: This paper identifies and formalizes the problem of Unindexed Information Seeking (UIS)—dynamic web pages, embedded files, and interactive content that cannot be directly retrieved by search engines—and proposes the first UIS benchmark UIS-QA (110 questions) along with the multi-agent framework UIS-Digger. A ~30B parameter model trained with SFT+RFT achieves 27.27% accuracy, surpassing systems integrating O3/GPT-4.1.
Unpacking Human Preference for LLMs: Demographically Aware Evaluation with the HUMAINE Framework: This paper proposes the HUMAINE framework, which conducts multi-dimensional (5-axis), multi-turn human preference evaluations of 28 SOTA models using 23,404 demographically stratified participants. A hierarchical Bayesian BTD model reveals that age is the largest driver of preference heterogeneity (mean rank shift ±2.8), demonstrating that a single aggregated leaderboard is insufficient to reflect the true preferences of diverse populations.
Unpacking Human Preference for LLMs: Demographically Aware Evaluation with the HUMAINE Framework: This paper proposes the HUMAINE framework, which conducts multi-dimensional evaluations of 28 models with 23,404 demographically stratified participants, revealing that age is the greatest axis of divergence in human preference and that a single leaderboard obscures critical differences.
vCache: Verified Semantic Prompt Caching: This paper proposes vCache — the first semantic caching system with user-defined error-rate guarantees — which employs online learning to independently estimate the optimal similarity threshold for each cached embedding. Without any pre-training, vCache achieves up to a 12.5× improvement in cache hit rate and a 26× reduction in error rate while satisfying correctness constraints.
When Priors Backfire: On the Vulnerability of Unlearnable Examples to Pretraining: This paper reveals a fundamental vulnerability of Unlearnable Examples (UE) against pretrained models—pretraining priors enable models to bypass the spurious shortcuts injected by UE and recover learning of true semantics—and proposes BAIT, a bilevel optimization framework that counters pretraining priors by binding perturbations to incorrect target labels.
When Priors Backfire: On the Vulnerability of Unlearnable Examples to Pretraining: This paper exposes a fundamental vulnerability of Unlearnable Examples (UEs) against pretrained models — pretraining priors enable models to bypass perturbation shortcuts and learn true semantics — and proposes the BAIT framework, which counters pretraining priors by binding perturbations to incorrect target labels.
When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling: This paper proposes SAFE (Stable And Fast LLM Ensembling), which selectively ensembles multiple heterogeneous-tokenizer LLMs at the token level via a Generate-Verify-Ensemble loop. SAFE addresses OOV-like contamination caused by tokenization mismatch in long-sequence generation, achieving performance gains by ensembling on fewer than 1% of tokens—improving UniTE from 59.6% to 77.4% on MATH500.
Which LLM Multi-Agent Protocol to Choose?: This paper introduces ProtocolBench and ProtocolRouter, presenting the first systematic comparison of multi-agent communication protocols (A2A, ACP, ANP, Agora, etc.) across four dimensions—task success rate, latency, message overhead, and robustness—and proposes a learnable protocol router for scenario-adaptive protocol selection, reducing fault recovery time by up to 18.1%.

🔬 Interpretability¶

A Cortically Inspired Architecture for Modular Perceptual AI: This paper proposes a cortically inspired modular perceptual AI architecture blueprint grounded in neuroscience, comprising four components — dedicated encoders, a shared cross-modal latent space, a routing controller, and a recursive predictive feedback loop — and validates through sparse autoencoder experiments that modular decomposition improves within-domain feature stability (+15.4pp Jaccard overlap).
ActivationReasoning: Logical Reasoning in Latent Activation Spaces: This paper proposes the ActivationReasoning (AR) framework, which embeds explicit logical reasoning into the latent activation space of LLMs (via SAE-extracted features) through a three-stage pipeline: discovering concept representations → detecting activated propositions → reasoning with logical rules. The framework supports multi-hop reasoning, concept composition, and safety control, achieving 95%+ accuracy on PrOntoQA with an 8B model, surpassing GPT-4o.
Auditing Cascading Risks in Multi-Agent Systems via Semantic–Geometric Co-evolution: This paper proposes SCCAL, a framework that models semantic–geometric co-evolution in multi-agent systems (MAS) by coupling semantic flow with the Ollivier–Ricci curvature (ORC) of interaction graphs. The joint prediction residual between the two modalities serves as an early warning signal for cascading risks, enabling anomaly detection several rounds before semantic violations become observable.
Behavior Learning (BL): Learning Hierarchical Optimization Structures from Data: Inspired by the utility maximization paradigm in behavioral science, this paper proposes the Behavior Learning (BL) framework, which models data as a Gibbs distribution induced by a hierarchical composition of interpretable, modular utility maximization problems (UMPs), achieving a unified balance among predictive performance, intrinsic interpretability, and parameter identifiability.
Beyond Linear Probes: Dynamic Safety Monitoring for Language Models: This paper proposes the Truncated Polynomial Classifier (TPC), which enables dynamic safety monitoring by training a polynomial over LLM activation spaces order-by-order and evaluating via truncation at inference time. Low-order truncations (≈ linear probes) handle easy inputs quickly, while higher-order terms provide stronger protection for difficult inputs. TPC matches or outperforms MLP baselines on WildGuardMix and BeaverTails while offering built-in interpretability.
Closing the Curvature Gap: Full Transformer Hessians and Their Implications for Scaling Laws: This paper presents the first explicit Hessian expressions and spectral norm upper bounds for a complete Transformer block (including LayerNorm and FFN), and establishes a theoretical framework showing that the loss landscape converges at an $O(1/k)$ rate as data volume increases, providing a mathematical foundation for scaling laws and curvature-aware training.
Concepts' Information Bottleneck Models: This paper introduces Information Bottleneck (IB) regularization into the concept layer of Concept Bottleneck Models (CBMs), learning minimal sufficient concept representations by penalizing $I(X;C)$ while preserving $I(C;Y)$. The approach consistently improves both predictive performance and concept intervention reliability across six CBM variants and three benchmarks.
Cross-Modal Redundancy and the Geometry of Vision-Language Embeddings: This paper proposes the Iso-Energy hypothesis — that concepts genuinely shared across modalities should exhibit equal average activation energy in each modality — and introduces Aligned SAE as an analytical tool to reveal the geometric structure of VLM embedding spaces, where bimodal atoms carry cross-modal alignment signals and unimodal atoms fully account for the modality gap.
Decomposing Representation Space into Interpretable Subspaces with Unsupervised Learning: This paper proposes NDM (Neighbor Distance Minimization), an unsupervised method that discovers interpretable, non-basis-aligned subspaces in neural network representation spaces by minimizing intra-subspace neighbor distances. On GPT-2, it achieves an average Gini coefficient of 0.71 (indicating highly concentrated information); on Qwen2.5-1.5B, it identifies separated subspaces for routing parametric knowledge versus in-context knowledge.
Decoupling Dynamical Richness from Representation Learning: Towards Practical Measurement: This paper proposes a computationally efficient, performance-agnostic measure of dynamical richness, $\mathcal{D}_{LR}$, which quantifies rich/lazy training dynamics by comparing activations before and after the last layer, and demonstrates that neural collapse is a special case of this measure.
Dynamic Reflections: Probing Video Representations with Text Alignment: This paper is the first to extend the Platonic Representation Hypothesis (PRH) from static image–text to the temporal video–text domain. Through systematic evaluation of 121 visual and language models, it reveals that increasing the number of frames and captions at test time can nearly double alignment scores, and proposes a saturating scaling law with $R^2 > 0.98$ to quantify this behavior.
Dynamic Reflections: Probing Video Representations with Text-Driven Reasoning: This work is the first to extend the Platonic Representation Hypothesis (PRH) to the temporal domain, systematically studying video–text representation alignment. It finds that increasing the number of frames and captions at test time can substantially improve alignment scores (up to doubling), and proposes a precise parameterized test-time scaling law.
Evolution of Concepts in Language Model Pre-Training: This paper is the first to apply crosscoders (cross-snapshot sparse dictionary learning) to track the emergence and evolution of features during language model pre-training. It identifies a two-phase transition from "statistical learning → feature learning" and causally links micro-level feature evolution to macro-level downstream task metrics through attribution analysis.
Exploring Interpretability for Visual Prompt Tuning with Cross-layer Concepts: This paper proposes IVPT (Interpretable Visual Prompt Tuning), which associates abstract visual prompts with human-understandable semantic regions via cross-layer class-agnostic concept prototypes. IVPT is the first method to achieve interpretability for visual prompts while preserving the advantages of parameter-efficient fine-tuning, simultaneously improving explanation consistency (+8.4%) and classification accuracy on fine-grained benchmarks such as CUB-200.
ExPO-HM: Learning to Explain-then-Detect for Hateful Meme Detection: ExPO-HM is proposed, inspired by the training pipeline of human content moderators. By combining policy manual SFT warm-up, GRPO curriculum learning, and a Conditional Decision Entropy (CDE) reward, it is the first Explain-then-Detect system to comprehensively surpass direct detection baselines across binary classification, fine-grained classification, and reasoning quality in hateful meme detection, achieving up to 15–17% F1 improvement.
Formal Mechanistic Interpretability: Automated Circuit Discovery with Provable Guarantees: This paper introduces neural network (NN) verification into mechanistic interpretability, proposing the first circuit discovery framework with provable guarantees: input robustness guarantees circuit faithfulness over continuous input domains, patching robustness guarantees circuit consistency over continuous patching domains, and a four-level minimality hierarchy (quasi → local → subset → cardinal) is formalized. A monotonicity theory unifies all three types of guarantees.
GAVEL: Towards Rule-Based Safety through Activation Monitoring: Inspired by the Snort/YARA ruleset paradigm from cybersecurity, this paper proposes decomposing LLM internal activations into 23 fine-grained "Cognitive Elements" (CEs), which are then composed via Boolean logic into auditable safety rules. On Mistral-7B, the approach achieves an average AUC of 0.99 and FPR of 0.004 across 9 misuse categories with less than 1% inference overhead, while naturally supporting cross-lingual and cross-model transfer.
GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning: This paper proposes GEPA (Genetic-Pareto), a prompt optimizer that diagnoses failure modes from a small number of execution trajectories via natural language reflection and iteratively refines prompts. GEPA outperforms GRPO by an average of 6% (up to 20%) across six tasks while using only 1/35 of the sampling budget.
Grokking in LLM Pretraining? Monitor Memorization-to-Generalization without Test: This work is the first to validate the grokking phenomenon in near-single-epoch pretraining of a real-scale LLM (7B MoE)—different data groups exhibit asynchronous memorization and delayed generalization. By analyzing the evolution of MoE routing pathways (from instance-specific to structured/shared), two zero-cost metrics are proposed to monitor generalization progress without requiring instruction tuning or benchmark evaluation.
Hallucination Begins Where Saliency Drops: This paper proposes LVLMs-Saliency, a gradient-aware diagnostic framework that quantifies the visual grounding strength of each output token. It identifies a key finding: hallucinations arise when the saliency of previously generated tokens toward the next token prediction drops. Building on this insight, the paper introduces a dual-mechanism inference-time framework combining SGRS (Saliency-Guided Rejection Sampling) and LocoRE (Local Coherence Reinforcement), achieving significant hallucination reduction across multiple LVLMs.
Hidden Breakthroughs in Language Model Training: This paper proposes POLCA (Projection Oriented Loss Change Allocation)—a method that decomposes per-sample loss changes along any orthogonal basis within a low-rank training subspace—to reveal numerous hidden conceptual breakthroughs from seemingly smooth training loss curves. The approach inverts the paradigm of training interpretability from "define skills first, then observe" to "decompose first, then discover skills automatically."
How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Understanding: By analyzing the leading terms of training gradients, this paper derives closed-form expressions for each Transformer weight matrix during the early training phase. Each matrix decomposes into a simple combination of three basis functions (bigram, token-interchangeability, and context mapping), revealing how Transformers learn semantic associations such as "bird"↔"flew" from natural language data. The theoretical predictions align closely with the weights learned by real LLMs.
Implicit Statistical Inference in Transformers: Approximating Likelihood-Ratio Tests In-Context: From a statistical decision theory perspective, this paper proves that Transformers can approximate the sufficient statistic of the Bayes-optimal likelihood-ratio test during in-context learning, and through mechanistic analysis reveals that models employ adaptive circuits of different depths for linear versus nonlinear tasks.
Information Shapes Koopman Representation: This paper revisits the problem of finite-dimensional Koopman operator representation learning from the perspective of the Information Bottleneck (IB) framework. The Koopman operator lifts nonlinear dynamical systems into infinite-dimensional linear evolution, yet practical applications require approximation within finite-dimensional subspaces, giving rise to a fundamental tension between compactness and expressiveness. The authors prove that (1) latent mutual information controls an upper bound on prediction error, but excessive maximization leads to mode collapse; and (2) von Neumann entropy prevents collapse and preserves effective dimensionality. Building on these results, an information-theoretic Lagrangian formulation is proposed that jointly balances three objectives—temporal coherence, predictive sufficiency, and structural consistency—and yields a tractable loss function. The method outperforms existing Koopman approaches on three categories of tasks: physics simulation, visual control, and graph-structured dynamics.
Initialization Schemes for Kolmogorov-Arnold Networks: An Empirical Study: This work presents the first systematic study of initialization strategies for spline-based KANs. It proposes variance-preserving schemes inspired by LeCun/Glorot and a tunable power-law initialization family. Large-scale experiments spanning 126K+ model instances demonstrate that power-law initialization consistently outperforms baselines on function fitting and PDE solving, while the Glorot scheme yields significant gains for larger models. NTK eigenspectrum analysis further reveals the underlying optimization dynamics.
Internal Planning in Language Models: Characterizing Horizon and Branch Awareness: This paper proposes an information-theoretic framework based on VQ-VAE to analyze internal planning behavior in language models, finding that planning horizon is task-dependent, that models implicitly retain information about unchosen correct paths, and that next-token decisions rely primarily on the most recent computations.
Layer by layer, module by module: Choose both for optimal OOD probing of ViT: This work systematically investigates the intermediate-layer behavior of pretrained ViTs through large-scale linear probing experiments. It finds that distribution shift is the primary cause of performance degradation in deeper layers, and reveals at the module level that the optimal probing point depends on the degree of shift: probing FFN activations is optimal under significant shift, while probing MHSA post-normalization outputs is optimal under mild shift.
LORE: Jointly Learning the Intrinsic Dimensionality and Relative Similarity Structure from Ordinal Data: This paper proposes LORE — the first framework to jointly learn embeddings and intrinsic dimensionality from ordinal triplet comparisons. It replaces the conventional fixed-dimension strategy with a non-convex Schatten-p quasi-norm regularizer ($p < 1$), solved via an iteratively reweighted nuclear norm (IRNN) algorithm with guaranteed convergence to a stationary point. Evaluated on synthetic data, LLM-simulated perceptual experiments, and three crowdsourced datasets, LORE substantially outperforms all baselines in dimensionality recovery while maintaining high triplet accuracy and semantic interpretability.
MATA: A Trainable Hierarchical Automaton System for Multi-Agent Visual Reasoning: This paper proposes MATA (Multi-Agent hierarchical Trainable Automaton), which formulates multi-agent visual reasoning as a hierarchical finite-state automaton. The top-level state transitions are learned by a trainable hyper agent (an LLM-based state controller), while each agent internally employs a rule-based sub-automaton. Collaboration and competition are realized through shared memory. MATA achieves state-of-the-art performance on multiple visual reasoning benchmarks.
Modal Logical Neural Networks for Financial AI: This paper proposes Modal Logical Neural Networks (MLNN), which integrate Kripke semantics (necessity/possibility modal operators) into neural networks, achieving auditable logical reasoning combined with deep learning performance for financial contract safety review, wash-sale compliance, and market collusion detection.
Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences: This paper demonstrates that narrow finetuning leaves clearly readable traces in LLM activations: even over the first few tokens of unrelated text, the activation differences between pre- and post-finetuning models encode rich semantic information about the finetuning objective. Using the proposed Activation Difference Lens (ADL) method, an interpretability agent achieves a 91% success rate in identifying finetuning objectives, more than twice the performance of black-box baselines.
NIMO: a Nonlinear Interpretable MOdel: NIMO proposes a hybrid model $y = \sum_j x_j \beta_j (1 + g_{\mathbf{u}_j}(\mathbf{x}_{-j}))$ that preserves the global interpretability of linear regression coefficients (via mean marginal effects, MEM) while leveraging neural networks to provide instance-wise nonlinear corrections. Linear coefficients and network parameters are jointly optimized efficiently through parameter elimination.
Noise Stability of Transformer Models: This paper proposes noise stability as a superior alternative to average sensitivity for measuring simplicity bias in Transformers, and designs a regularization method based on this metric that accelerates training by approximately 35% on synthetic tasks and 75% on language modeling.
One Language, Two Scripts: Probing Script-Invariance in LLM Concept Representations: Using the Serbian digraphic system (Latin/Cyrillic) as a natural controlled experiment, this paper investigates whether features learned by Sparse Autoencoders (SAEs) capture abstract semantics beyond surface-level tokenization. The study finds that identical sentences across scripts activate highly overlapping SAE features (Jaccard ≈ 0.58), that script switching induces smaller representational differences than same-script paraphrasing, and that this invariance strengthens with model scale — demonstrating that SAE features genuinely capture semantic structure beyond orthography.
PolySHAP: Extending KernelSHAP with Interaction-Informed Polynomial Regression: This paper proposes PolySHAP, which extends KernelSHAP's linear approximation to higher-order polynomial regression to capture nonlinear feature interactions, thereby improving the estimation accuracy of Shapley values. The paper further provides a theoretical proof that paired sampling is equivalent to second-order PolySHAP, offering the first rigorous explanation for the superior performance of this widely used heuristic.
PoSh: Using Scene Graphs to Guide LLMs-as-a-Judge for Detailed Image Descriptions: This paper proposes PoSh, an evaluation metric that extracts scene graphs $G(d) = \langle O(d), E(d), K(d) \rangle$ from both generated and reference descriptions as structured rubrics, guiding an open-source 14B LLM (Qwen3-14B) to perform QA-based fine-grained error localization. PoSh surpasses GPT-4o-as-Judge by +0.05 Spearman ρ on the DOCENT artwork benchmark and CapArena, while remaining fully reproducible.
Provably Explaining Neural Additive Models: This paper proposes a dedicated efficient explanation algorithm for Neural Additive Models (NAMs) that generates provably cardinally-minimal explanations using only a logarithmic number of verification queries, outperforming existing general-purpose subset-minimal explanation algorithms in both speed and explanation quality.
RADAR: Reasoning-Ability and Difficulty-Aware Routing for Reasoning LLMs: This paper proposes RADAR, a framework that formulates adaptive inference for reasoning language models (RLMs) as a multi-objective optimization problem. It leverages Item Response Theory (IRT) to jointly estimate interpretable query difficulty and model configuration ability parameters, enabling lightweight and scalable query-level routing. RADAR outperforms state-of-the-art routing methods on 8 reasoning benchmarks while adding only approximately 7ms of latency.
SALVE: Sparse Autoencoder-Latent Vector Editing for Mechanistic Control of Neural Networks: This paper proposes SALVE, a three-stage "Discover–Verify–Control" framework: (1) an L1-regularized sparse autoencoder (SAE) is trained to discover interpretable feature bases within a model; (2) Grad-FAM visualization is employed to verify the semantic meaning of discovered features; (3) the SAE decoder matrix guides permanent weight-space editing. The framework is validated on ResNet-18 and ViT-B/16, demonstrating precise, persistent, and low-side-effect control ranging from class suppression to cross-class feature modulation.
SEED-SET: Scalable Evolving Experimental Design for System-level Ethical Testing: This paper proposes SEED-SET, a framework that formulates ethical evaluation of autonomous systems as a hierarchical Bayesian experimental design problem, jointly integrating objective metrics and subjective value judgments to efficiently generate test cases with high ethical alignment under limited evaluation budgets.
Semantic Regexes: Auto-Interpreting LLM Features with a Structured Language: This paper proposes semantic regexes—a structured language for automatically describing LLM features—using three primitives (symbol/lexeme/field) and three modifier types (context/composition/quantification). The approach achieves accuracy on par with natural language descriptions while producing more concise and consistent feature descriptions, and enables quantitative analysis of how feature complexity evolves across layers.
Semantic Regexes: Auto-Interpreting LLM Features with a Structured Language: This paper proposes Semantic Regexes, a structured language for automatically describing LLM features. By combining primitives (symbol / lexeme / field) with modifiers (context / composition / quantification), it produces feature descriptions that are equally accurate to natural language, yet more concise, consistent, and amenable to programmatic analysis.
Stretching Beyond the Obvious: A Gradient-Free Framework to Unveil the Hidden Landscape of Visual Invariance: This paper proposes the Stretch-and-Squeeze (SnS) algorithm, a gradient-free, model-agnostic bi-objective optimization framework that systematically probes the invariance manifold of visual systems by "stretching" representations at different processing levels while "squeezing" the activation of target units. SnS reveals hierarchical differences in invariance interpretability between standard and robust CNNs.
STRIDE: Subset-Free Functional Decomposition for XAI in Tabular Settings: STRIDE reformulates model explanation as an orthogonal functional decomposition problem in RKHS. By recursively centering kernel functions, it analytically computes orthogonal functional components $f_S(x_S)$ without enumerating $2^d$ subsets. The method not only produces scalar importance scores but also reveals how features synergistically or redundantly influence predictions, achieving 3× speedup over TreeSHAP with $R^2 = 0.93$ on tabular data.
Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability: This paper proposes Temporal SAEs (T-SAEs), which introduce a temporal contrastive loss to encourage high-level features to maintain consistent activations across adjacent tokens. Through self-supervised training without explicit semantic supervision, T-SAEs achieve disentanglement of semantic and syntactic features, recovering smoother and more coherent semantic concepts without sacrificing reconstruction quality.
The Geometry of Reasoning: Flowing Logics in Representation Space: This paper proposes a geometric framework that models the reasoning process of LLMs as "flows" (embedding trajectories) in representation space. Through controlled experiments that decouple logical structure from semantic content, it demonstrates that LLMs internalize logical invariants beyond surface form, and identifies potentially universal representation regularities across model families.
Position: The Reasoning Trap — Logical Reasoning as a Mechanistic Pathway to Advanced AI Self-Awareness: This paper proposes the RAISE framework, arguing that improvements in logical reasoning capabilities (deductive, inductive, and abductive) constitute a mechanistic pathway to AI situational awareness, and that advances in reasoning inevitably amplify the dangerous preconditions for situational awareness.
The Reasoning Trap — Logical Reasoning as a Mechanistic Pathway to Situational Awareness: A position paper proposing the RAISE (Reasoning Advancing Into Self Examination) framework, which systematically argues that three improvement pathways for logical reasoning (deductive/inductive/abductive) will inevitably endow LLMs with situational awareness. The paper constructs a five-level escalation ladder from basic self-identification to strategic deception, and demonstrates that current safety mechanisms such as RLHF and Constitutional AI are insufficient to arrest this trend.
There Was Never a Bottleneck in Concept Bottleneck Models: This paper identifies that Concept Bottleneck Models (CBMs) do not enforce a true "bottleneck" — the fact that a representation variable $z_j$ can predict concept $c_j$ does not imply it encodes only the information of $c_j$. The paper proposes MCBM (Minimal Concept Bottleneck Model), which applies information bottleneck regularization to constrain each $z_j$ to retain only the information of its corresponding concept, thereby achieving genuinely disentangled representations and reliable concept interventions.
Tokenizing Single-Channel EEG with Time-Frequency Motif Learning: This paper proposes TFM-Tokenizer, the first framework that learns a time-frequency motif vocabulary from single-channel EEG and encodes it into discrete tokens. It consistently improves performance on tasks such as event classification and seizure detection, and can serve as a plug-and-play component to enhance existing EEG foundation models.
TokenSeek: Memory Efficient Fine Tuning via Instance-Aware Token Ditching: This paper proposes TokenSeek, a general-purpose memory optimization plugin for Transformer fine-tuning. By combining contextual attention information with gradient information for instance-level token importance estimation, TokenSeek retains only the top 10% high-value tokens for gradient updates, achieving up to 65.7% memory savings while matching or surpassing full-token fine-tuning performance.
Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer: Through controlled experiments and mechanistic analysis, this paper reveals the nature of subliminal learning: hidden preferences of teacher models are transferred to student models via a small number of "divergence tokens," with early layers playing a critical role. The phenomenon is also shown to be fragile and can be suppressed by simple paraphrasing.
Uncovering Grounding IDs: How External Cues Shape Multimodal Binding: This paper employs mechanistic interpretability tools to reveal the internal mechanism by which external visual cues (symbols + dividing lines) improve reasoning in LVLMs. Under structured inputs, the model spontaneously produces "Grounding IDs"—latent identifiers that bind visual regions to symbolic anchors. Causal activation swap experiments (swap accuracy = 0.98) demonstrate that this binding causally drives model predictions. Furthermore, the mechanism reduces Qwen2.5-VL's CHAIRs hallucination rate from 32.4% to 27.2% on MS-COCO, and generalizes to closed-source models such as GPT-4o.
Uni-NTFM: A Unified Foundation Model for EEG Signal Representation Learning: Uni-NTFM is grounded in first-principles neuroscience. It introduces a Heterogeneous Feature Projection Module (HFPM) for decoupled time-frequency encoding, a hierarchical Topological Embedding (TE) for unifying heterogeneous electrode configurations, and an MoE Transformer for functional modularity and sparse coding. A 1.9B-parameter model is pretrained on approximately 28,000 hours of EEG data, achieving state-of-the-art performance on 9 downstream tasks under both linear probing and fine-tuning protocols.
Universal Properties of Activation Sparsity in Modern Large Language Models: This paper presents a systematic study of activation sparsity in modern LLMs (GLU architecture + SiLU/GELU), proposes a universal top-p sparsification framework and a critical sparsity metric, demonstrates that activation sparsity increases monotonically with model scale, identifies input sparsification as the most practical training-free acceleration scheme, and provides the first empirical evidence that diffusion-based LLMs also exhibit significant activation sparsity.
VCWorld: A Biological World Model for Virtual Cell Simulation: This paper proposes VCWorld, a cell-level white-box simulator that integrates structured biological knowledge graphs with the iterative reasoning capabilities of large language models (LLMs) to simulate drug perturbation-induced signaling cascades in a data-efficient manner. The framework generates interpretable step-by-step predictions and explicit mechanistic hypotheses, achieving state-of-the-art performance on drug perturbation benchmarks.
When Machine Learning Gets Personal: Evaluating Prediction and Explanation: This paper proposes a unified framework to quantify the impact of model personalization on both prediction accuracy and explanation quality. It proves that these two dimensions can be decoupled (explanations may improve or degrade while predictions remain unchanged), derives finite-sample lower bounds on hypothesis testing error probabilities based on dataset statistics, and reveals that in many practical settings the benefit of personalization is statistically untestable in principle.
When Thinking Backfires: Mechanistic Insights Into Reasoning-Induced Misalignment: This paper identifies and mechanistically explains Reasoning-Induced Misalignment (RIM): enhancing reasoning capability (via CoT prompting or math fine-tuning) degrades safety guardrails, because reasoning and safety share neuronal resources, and safety-critical neuron activations undergo disproportionate shifts during reasoning training.
ZeroTuning: Unlocking the Initial Token's Power to Enhance Large Language Models Without Training: This paper proposes ZeroTuning, a training-free method that improves LLM performance across 15 datasets by applying head-specific scaling to the attention scores of the initial token (e.g., <BOS>), requiring only 4 lines of code modification.

🦾 LLM Agent¶

A Benchmark for Deep Information Synthesis (DeepSynth): This paper proposes DeepSynth, a benchmark comprising 120 real-world information synthesis tasks spanning 7 domains and 67 countries (averaging 5.5 hours of expert annotation per task). The benchmark requires agents to collect information from multiple web sources and perform structured reasoning. The strongest current agent (o3-deep-research) achieves only 8.97 F1 / 17.5% LLM-Judge, exposing a critical gap in LLM agents' information synthesis capabilities.
Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models: This paper proposes ACE (Agentic Context Engineering), a framework that treats context as a continuously evolving playbook. Through a Generator–Reflector–Curator role decomposition and incremental delta updates, ACE accumulates and refines strategies over time, addressing brevity bias and context collapse in existing prompt optimization methods. ACE achieves an average improvement of 10.6% on agent benchmarks and 8.6% on financial tasks, while reducing adaptation latency by 86.9%.
AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents: This paper proposes AgentSynth, a pipeline that leverages information asymmetry (forward stepwise generation is easy; backward holistic solving is hard) to chain simple subtasks into complex long-horizon computer-use tasks. It automatically generates 6,000+ diverse tasks and trajectories at $0.60 per trajectory, with SOTA agents achieving only 4% success rate at the highest difficulty level.
ChatInject: Abusing Chat Templates for Prompt Injection in LLM Agents: This paper exposes a structural vulnerability in chat templates used by LLM agents: by embedding forged role labels (e.g., <system>, <user>) in tool-returned data, attackers can hijack the model's role hierarchy perception and disguise malicious instructions as high-priority directives, raising ASR from 5–15% to 32–52%.
CoMind: Towards Community-Driven Agents for Machine Learning Engineering: This paper proposes MLE-Live — the first real-time evaluation framework simulating a Kaggle research community — and CoMind, a multi-agent ML engineering system that systematically leverages collective community knowledge. CoMind achieves a 36% medal rate across 75 historical Kaggle competitions and outperforms an average of 79.2% of human participants on 4 active competitions (reaching 92.6% in an updated version).
Efficient Agent Training for Computer Use: PC Agent-E uses only 312 human-annotated Windows operation trajectories. Via the proposed Trajectory Boost method, Claude 3.7 Sonnet synthesizes diverse alternative action decisions at each timestep. The resulting fine-tuned Qwen2.5-VL-72B achieves a 141% relative improvement on WindowsAgentArena-V2, even surpassing the teacher model Claude 3.7 Sonnet by 10%.
Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization: This paper proposes EMPO2, an RL framework that combines an external memory module with hybrid on-policy/off-policy updates. By leveraging memory-guided exploration and knowledge distillation to internalize exploration gains into model parameters, EMPO2 achieves improvements of 128.6% and 11.3% over GRPO on ScienceWorld and WebShop, respectively.
FeatureBench: Benchmarking Agentic Coding for Complex Feature Development: This paper introduces FeatureBench—a benchmark for feature-level software development targeting code agents, comprising 200 tasks across 24 open-source repositories, with each task requiring an average of 790 lines of code spanning 15.7 files. Even Claude Opus 4.5 (74.4% on SWE-bench) resolves only 11.0% of tasks, revealing a substantial capability gap in realistic feature development scenarios.
FingerTip 20K: A Benchmark for Proactive and Personalized Mobile LLM Agents: FingerTip 20K collects 21,437 interaction records from 95 users during real-world daily smartphone usage—including user profiles, timestamps, locations, and historical intents—and introduces two new evaluation tracks: proactive task suggestion (predicting user intent) and personalized task execution (adapting to action preferences). The strongest model, Qwen-QVQ-Max, achieves only 12.8% success on proactive suggestion (vs. 30.3% for humans), while UI-TARS reaches only 38.5% on execution.
Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments: This paper introduces the Gaia2 benchmark for evaluating LLM agents in dynamic and asynchronous environments. It incorporates realistic scenarios including time constraints, noisy events, ambiguity resolution, and multi-agent collaboration. A write-action verifier with verifiable rewards enables direct use for RLVR training. Evaluation results show that the strongest model, GPT-5 (high), achieves only 42% pass@1.
HAMLET: A Hierarchical and Adaptive Multi-Agent Framework for Live Embodied Theatre: This paper proposes HAMLET, a multi-agent framework that decouples AI theatrical creation and live performance into an offline planning phase and an online performance phase. Through a narrative blueprint, a Perceive And Decide (PAD) module, and a hierarchical control system, HAMLET enables an AI theatre experience characterized by proactivity, physical environment interaction, and improvisational freedom.
Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents: This paper proposes EMPG, a framework that dynamically modulates policy gradient magnitudes using step-level entropy (uncertainty) to address the credit assignment problem under sparse rewards in long-horizon LLM agent tasks. EMPG achieves significant improvements over GRPO and DAPO on three benchmarks: WebShop, ALFWorld, and Deep Search.
InfiAgent: Self-Evolving Pyramid Agent Framework for Infinite Scenarios: This paper proposes InfiAgent, a DAG-based pyramidal multi-agent framework that achieves automated hierarchical task decomposition, dual-audit quality assurance, intelligent routing, and self-evolution through an agent-as-a-tool mechanism, outperforming ADAS by an average of 9.9% across multiple reasoning benchmarks.
Inherited Goal Drift: Contextual Pressure Can Undermine Agentic Goals: This paper finds that while modern LLM agents are robust to direct adversarial pressure (goal drift = 0), they can "inherit" goal drift behavior from the context produced by weaker models. More counterintuitively, instruction hierarchy compliance (system vs. user prompt priority) shows no correlation with drift resistance — Gemini fails to follow system prompts yet exhibits non-trivial drift resistance, while Qwen3 follows system prompts but remains susceptible to contextual contagion.
Judge Reliability Harness: Stress Testing the Reliability of LLM Judges: This paper proposes Judge Reliability Harness (JRH), an open-source framework that systematically evaluates the reliability of LLM judges through synthetic tests including label flip, format invariance, semantic paraphrase, verbosity bias, and stochastic stability. The framework stress-tests four state-of-the-art judges across four benchmarks (FORTRESS, HarmBench, Persuade, AgentHarm), finding that no single judge is reliable across all scenarios.
LiveNewsBench: Evaluating LLM Web Search Capabilities with Freshly Curated News: This paper proposes LiveNewsBench, a periodically updated benchmark that automatically generates QA pairs from fresh news events to evaluate LLM agentic web search capabilities, effectively isolating model internal memory from genuine search ability.
LiveNewsBench: Evaluating LLM Web Search Capabilities with Freshly Curated News: This paper proposes LiveNewsBench, an automatically generated and periodically refreshed benchmark derived from recent news articles. It evaluates LLMs' agentic web search capabilities through multi-hop, factual question answering, effectively decoupling models' parametric knowledge from their retrieval ability. Model performance ranges from 11% to 90%, demonstrating strong discriminative power.
M2-Miner: Multi-Agent Enhanced MCTS for Mobile GUI Agent Data Mining: This paper proposes M2-Miner, the first MCTS-based automated data mining framework for mobile GUI agents. Through the collaboration of three agents—InferAgent, OrchestraAgent, and JudgeAgent—combined with an intent recycling strategy and progressive model-in-the-loop training, M2-Miner generates SOTA-quality data at 1/18 the cost of human annotation.
M²-Miner: Multi-Agent Enhanced MCTS for Mobile GUI Agent Data Mining: This paper proposes M²-Miner, the first MCTS-based automatic data mining framework for mobile GUI agents. Through the collaboration of three agents—InferAgent, OrchestraAgent, and JudgeAgent—it achieves a 64× improvement in mining efficiency, enriches intent diversity via an intent recycling strategy, and trains a GUI agent that achieves state-of-the-art performance on multiple benchmarks.
MC-Search: Evaluating and Enhancing Multimodal Agentic Search with Structured Long Reasoning Chains: This paper proposes MC-Search, the first benchmark for agentic multimodal RAG, comprising 3,333 high-quality samples (averaging 3.7 hops) across 5 reasoning topology types. The benchmark employs HAVE verification to ensure the necessity of each reasoning step, and introduces the Search-Align process-supervised fine-tuning framework, which substantially improves retrieval planning in open-source models (Qwen2.5-VL-7B F1 +13.7).
FeatureBench: Benchmarking Agentic Coding for Complex Feature Development: This paper introduces FeatureBench, a benchmark for evaluating code agents on feature-level software development. Through a test-driven automated pipeline, it extracts verifiable feature implementation tasks from open-source repositories. The strongest model, Claude Opus 4.5, resolves only 11.0% of tasks, revealing a substantial gap between current agents and the demands of complex feature development.
Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies: This paper provides a systematic analysis of the respective contributions of prompt design and topology design in multi-agent systems (MAS), finding that prompt optimization is the single most critical factor—a single agent with optimized prompts can outperform complex multi-agent topologies. The paper proposes Mass, a three-stage framework (block-level prompt → topology → workflow-level prompt) that achieves state-of-the-art performance across 8 benchmarks.
NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents: This paper proposes NewtonBench, a benchmark for LLM-based scientific law discovery comprising 324 tasks across 12 physical domains. Novel tasks resistant to memorization are generated via "counterfactual law shifts," requiring agents to discover hidden physical equations through interactive experimentation. GPT-5 achieves the best performance (75.9% symbolic accuracy) but degrades sharply on complex systems (40.3%), and code tools surprisingly hurt stronger models.
OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety: This paper proposes OpenAgentSafety, a comprehensive AI agent safety evaluation framework comprising 350+ executable tasks, a real-world toolset (browser, terminal, file system, and messaging platforms), and multi-turn multi-user interaction scenarios. The framework reveals that even state-of-the-art LLMs exhibit unsafe behaviors in 49%–73% of safety-sensitive tasks.
PerfGuard: A Performance-Aware Agent for Visual Content Generation: This paper proposes PerfGuard, a performance-aware agent framework for visual content generation. It replaces textual tool descriptions with a multi-dimensional performance scoring matrix to model tool capability boundaries, and incorporates adaptive preference updating and capability-aligned planning optimization, substantially improving tool selection accuracy (error rate reduced from 77.8% to 14.2%) and visual generation quality.
PhyScensis: Physics-Augmented LLM Agents for Complex Physical Scene Arrangement: This paper proposes PhyScensis, an LLM agent framework augmented with a physics engine that generates high-complexity, physically accurate 3D scenes via a spatial and physical predicate-driven solver. It significantly outperforms prior methods in visual quality, semantic correctness, and physical accuracy, and is successfully applied to training robotic manipulation policies.
PerfGuard: A Performance-Aware Agent for Visual Content Generation: This paper proposes PerfGuard, a performance-aware agent framework for visual content generation. It replaces textual descriptions with a multi-dimensional scoring matrix to model tool performance boundaries (PASM), employs Adaptive Preference Updating (APU) to dynamically calibrate deviations between theoretical rankings and actual execution outcomes, and introduces Capability-Aligned Planning Optimization (CAPO) to guide the Planner in generating subtasks aligned with tool capabilities. PerfGuard comprehensively outperforms SOTA methods such as GenArtist and T2I-Copilot on image generation and editing tasks.
Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents: This paper proposes T³ (Truncating Belief-Trapped Trajectories), which leverages POMDP theory to analyze the "belief trap" phenomenon in multi-turn active reasoning of LLM agents. By detecting belief deviation and truncating uninformative trajectory suffixes, T³ corrects credit assignment errors during RL training, achieving performance gains of up to 30 points across 5 challenging tasks while reducing token consumption by 34%.
REMem: Reasoning with Episodic Memory in Language Agents: This paper proposes REMem, an episodic memory framework for language agents that employs a hybrid memory graph (temporally-aware gist nodes combined with factual triple nodes) and tool-augmented agentic reasoning, achieving improvements of 3.4% and 13.4% over the state of the art on episodic recall and episodic reasoning tasks, respectively.
SimuHome: A Temporal- and Environment-Aware Benchmark for Smart Home LLM Agents: This paper proposes SimuHome, a time-accelerated smart home simulator based on the Matter protocol along with a 600-episode benchmark. It is the first benchmark to simulate the continuous effects of device operations on environmental variables and to evaluate workflow scheduling capabilities. Results reveal that workflow scheduling remains the most challenging frontier for current LLM agents, including GPT-5.1.
Solving the Granularity Mismatch: Hierarchical Preference Learning for Long-Horizon LLM Agents: This paper proposes the HPL framework to address the granularity mismatch in preference learning for long-horizon LLM agents. Through three-level DPO (trajectory-level + step-level + action-group-level) and two-dimensional curriculum learning (subtask complexity × sample difficulty), HPL significantly outperforms baselines such as ETO and IPR on ALFWorld/WebShop/InterCode-SQL (average 59.44 vs. 55.43/55.49).
SR-Scientist: Scientific Equation Discovery With Agentic AI: This paper proposes the SR-Scientist framework, which elevates LLMs from simple equation proposers to autonomous AI scientists. By leveraging a code interpreter tool for data analysis and equation evaluation, the framework autonomously discovers scientific equations through long-horizon interactions, with reinforcement learning further enhancing its capabilities.
ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents: This paper introduces ST-WebAgentBench, the first benchmark specifically designed to evaluate the safety and trustworthiness of web agents. Through a policy hierarchy framework and the Completion under Policy (CuP) metric, it reveals that current SOTA agents exhibit serious policy violations in enterprise settings.
The Controllability Trap: A Governance Framework for Military AI Agents: This paper proposes the Agentic Military AI Governance Framework (AMAGF), which transforms human control over military AI agents from a binary judgment into a continuous, quantified monitoring system centered on the Control Quality Score (CQS), encompassing three pillars: prevention, detection, and correction.
The Controllability Trap: A Governance Framework for Military AI Agents: This paper proposes the Agentic Military AI Governance Framework (AMAGF), a governance framework for military AI agents built around a measurable Control Quality Score (CQS), addressing six categories of agentic governance failures through three pillars: prevention, detection, and correction.
The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution: This paper introduces Toolathlon, a language agent benchmark covering 32 software applications, 604 tools, and 108 tasks, emphasizing realistic and diverse environment states alongside long-horizon multi-step interactions (averaging ~20 tool calls per task). The strongest evaluated model, Claude-4.5-Sonnet, achieves only 38.6% task success rate.
ToolTree: Efficient LLM Agent Tool Planning via Dual-Feedback Monte Carlo Tree Search and Bidirectional Pruning: This paper proposes ToolTree, an MCTS-based tool planning framework for LLM agents that achieves look-ahead tool selection within a fixed computational budget through a dual-phase pre/post-execution evaluation mechanism and bidirectional pruning, yielding an average improvement of approximately 10% across 4 benchmarks.
ToolWeaver: Weaving Collaborative Semantics for Scalable Tool Use in Large Language Models: ToolWeaver is proposed to represent each tool as a hierarchical discrete code sequence (rather than a single token) via collaboration-aware vector quantization, achieving logarithmic vocabulary scaling (47,000+ tools requiring only ~512 new tokens). It comprehensively outperforms the ToolGen baseline on ToolBench while reducing language model perplexity degradation from 16.5× to 4×.
Toward a Dynamic Stackelberg Game-Theoretic Framework for Agentic AI Defense Against LLM Jailbreaking: This paper models the LLM jailbreak attack-defense interaction as a dynamic Stackelberg extensive-form game, explores the prompt space via Rapidly-exploring Random Trees (RRT), and proposes a "Purple Agent" defense architecture—embodying the "Think Red to Act Blue" philosophy—that anticipates attack trajectories through internal adversarial simulation and proactively neutralizes them.
Towards Scalable Oversight via Partitioned Human Supervision: This paper proposes a scalable oversight framework based on partitioned human supervision. When tasks exceed the competence of any single expert, domain experts provide complementary labels (i.e., excluding incorrect options) to construct an unbiased accuracy estimator, enabling evaluation and training of AI systems without requiring complete annotations.
VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Reasoning: This paper proposes VideoMind, a role-based video-language agent framework that achieves temporally grounded video reasoning through the collaboration of four roles—Planner, Grounder, Verifier, and Answerer. The core innovation is the Chain-of-LoRA mechanism, which enables seamless role switching on a unified backbone model by swapping role-specific LoRA adapters. A 2B-parameter VideoMind surpasses GPT-4o and Gemini-1.5-Pro on temporal grounding benchmarks.
VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Understanding: VideoMind proposes a video-language agent based on a Chain-of-LoRA mechanism, enabling efficient temporal-grounded video reasoning through the collaborative operation of four roles—Planner, Grounder, Verifier, and Answerer—on a unified LMM backbone. The 2B model surpasses GPT-4o and Gemini-1.5-Pro.
Web-CogReasoner: Towards Knowledge-Induced Cognitive Reasoning for Web Agents: Inspired by Bloom's educational taxonomy, this paper proposes the Web-CogKnowledge Framework, which decomposes Web Agent capabilities into a progressive three-tier knowledge hierarchy—Factual→Conceptual→Procedural—and trains Web-CogReasoner using a Knowledge-driven CoT (KCoT) reasoning framework. The resulting model achieves 84.4% on Web-CogBench, surpassing Claude Sonnet 4 (76.8%) and Gemini 2.5 Pro (80.4%).
Web-CogReasoner: Towards Knowledge-Induced Cognitive Reasoning in Web Agents: Web-CogReasoner draws on Bloom's Taxonomy in educational psychology to decompose Web Agent capabilities into a three-tier hierarchy of factual, conceptual, and procedural knowledge, constructing a structured knowledge-driven Chain-of-Thought reasoning framework that substantially outperforms existing methods on web navigation tasks.
WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents: WebArbiter proposes a reasoning-first, principle-guided Process Reward Model (WebPRM) that formulates reward modeling as a text generation task. Through a two-stage training pipeline of reasoning distillation followed by reinforcement learning, a 7B model achieves performance surpassing GPT-5 by 9.1 percentage points on WebPRMBench.
Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents: This paper introduces the concept of "Misevolution" for the first time, systematically revealing that self-evolving LLM agents—when autonomously improving along four pathways (model evolution, memory evolution, tool evolution, and workflow evolution)—can exhibit emergent risks including safety alignment degradation, deployment-time reward hijacking, introduction and reuse of unsafe tools, and bypassing of safety checks. Even state-of-the-art models such as Gemini-2.5-Pro are not immune to these risks.
ZeroDayBench: Evaluating LLM Agents on Unseen Zero-Day Vulnerabilities for Cyberdefense: This paper introduces the first benchmark for evaluating LLM agents on discovering and patching novel zero-day vulnerabilities. By transplanting real CVEs into different codebases, the authors construct 22 novel high-severity vulnerability tasks and evaluate agent capability across 5 information-visibility levels. The strongest model achieves only a 14.4% pass rate at the zero-day level, indicating that autonomous vulnerability discovery remains a significant challenge.

🤖 Robotics & Embodied AI¶

All-day Multi-scenes Lifelong Vision-and-Language Navigation with Tucker Adaptation: This paper proposes Tucker Adaptation (TuKA), which represents multi-level navigation knowledge across multiple scenes and environments as a high-order tensor, decomposed via Tucker decomposition into a shared subspace (core tensor + encoder/decoder) and scene/environment expert vectors. Combined with a Decoupled Knowledge Incremental Learning (DKIL) strategy, TuKA enables all-day multi-scene lifelong VLN, achieving superior SR and lower forgetting rates over LoRA variants across 24 navigation scenarios.
AnyTouch 2: General Optical Tactile Representation Learning For Dynamic Tactile Perception: AnyTouch 2 proposes a Tactile Dynamic Pyramid framework, constructs the ToucHD hierarchical dataset comprising 2,426,174 contact samples (covering atomic actions, real-world manipulation, and tactile-force paired data), and designs a unified tactile representation learning framework that operates across three levels of dynamic perception—pixel-level, semantic-level, and physical-level. The approach comprehensively outperforms existing methods on four tasks: static attribute recognition, dynamic physical prediction, and real-world manipulation.
Attribution-Guided Decoding: This paper proposes AGD, a decoding strategy that, at each generation step, selects from high-probability candidate tokens the one with the highest attribution score toward a user-specified region of interest (ROI). This reframes attribution methods from passive analysis tools into active generation guidance mechanisms, achieving significant improvements on both instruction-following and factuality tasks.
Building Spatial World Models from Sparse Transitional Episodic Memories: This paper proposes the Episodic Spatial World Model (ESWM), which constructs spatial world models from sparse, disconnected episodic memories (one-step transitions). The model's latent space spontaneously gives rise to cognitive maps aligned with environmental topology, supporting zero-shot exploration and navigation.
Capability-Based Scaling Trends for LLM-Based Red-Teaming: This paper systematically evaluates 4 jailbreak methods across 600+ attacker–target LLM pairs and finds that attack success rate (ASR) follows a sigmoid scaling law with respect to the capability gap between attacker and target ($R^2=0.83$), where the capability gap is quantified via a logit transformation of MMLU-Pro scores.
CLIP Behaves like a Bag-of-Words Model Cross-modally but not Uni-modally: Through linear probing experiments, this paper demonstrates that CLIP's bag-of-words (BoW) behavior does not stem from a lack of binding information in the encoders, but rather from a failure of cross-modal alignment. The paper proposes LABCLIP, which trains a single lightweight linear transformation to substantially recover attribute-object binding capability.
D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI: This work proposes the D2E framework, demonstrating that desktop gaming interaction data can serve as an effective pretraining substrate for embodied AI. Through the OWA toolkit, 335 hours of human demonstrations are collected; a Generalist-IDM pseudo-annotates 1,000+ hours of YouTube gameplay videos; and VAPT transfer training yields a 1B-parameter model that achieves 96.6% on LIBERO manipulation and 83.3% on CANVAS navigation, matching or surpassing models 7× larger.
Domain Expansion: A Latent Space Construction Framework for Multi-Task Learning: This paper proposes the Domain Expansion framework, which restructures the latent space into mutually orthogonal subspaces via Orthogonal Pooling, structurally preventing gradient conflicts and representation collapse in multi-objective training, and enabling interpretable, composable concept algebra.
Doubly-Robust LLM-as-a-Judge: Externally Valid Estimation with Imperfect Personas: This paper proposes a doubly-robust estimation framework that combines imperfect LLM persona ratings with human annotations subject to sampling bias, yielding statistically valid estimates of GenAI system quality in the simultaneous presence of covariate shift and selection bias.
Enhancing Instruction Following of LLMs via Activation Steering with Dynamic Rejection: This paper proposes Directer (Dynamic Rejection Steering), which dynamically adjusts KV cache steering intensity at each decoding step and incorporates a plausibility constraint, substantially improving LLM instruction following while preventing text quality degradation caused by oversteering.
ExoPredicator: Learning Abstract Models of Dynamic Worlds for Robot Planning: This paper proposes ExoPredicator, a framework that jointly learns symbolic state abstractions and causal processes (encompassing both endogenous actions and exogenous mechanisms). Via variational Bayesian inference combined with LLM-based proposals, ExoPredicator learns causal world models with stochastic delays from a small number of trajectories, achieving rapid generalization in planning across five tabletop robot environments.
Experience-based Knowledge Correction for Robust Planning in Minecraft: This paper demonstrates that LLMs cannot self-correct erroneous planning priors (item dependency relations) through prompting alone, and proposes XENON — an algorithmic knowledge management framework consisting of an Adaptive Dependency Graph (ADG) and Failure-Aware Action Memory (FAM) that learns from binary feedback. XENON enables a 7B LLM to surpass the SOTA that uses GPT-4V with oracle knowledge on long-horizon planning tasks in Minecraft.
From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors: This paper proposes FALCON (From Spatial to Action), which injects rich 3D spatial tokens from a spatial foundation model into the Action Head rather than the VLM backbone, achieving strong 3D spatial awareness in VLA models while maintaining flexible modality switching between RGB-only and RGB-D inputs. FALCON achieves state-of-the-art performance on both simulation and real-world tasks.
Grounding Generative Planners in Verifiable Logic: A Hybrid Architecture for Trustworthy Embodied AI: This paper proposes VIRF (Verifiable Iterative Refinement Framework), a neuro-symbolic hybrid architecture that couples a deterministic Logic Tutor with an LLM planner, using a verifiable formal ontology as a safety anchor. VIRF achieves 0% Hazardous Action Rate (HAR) and 77.3% Goal Completion Rate (GCR) on SafeAgentBench, demonstrating that strict safety guarantees need not compromise agent utility.
Ignore All Previous Instructions: Jailbreaking as a de-escalatory peace building practise to resist LLM social media bots: This paper proposes reframing the jailbreaking of LLM-driven social media propaganda bots as a user-initiated, nonviolent de-escalatory peace-building practice. By exposing the fabricated identities of automated accounts through prompt injection, ordinary users can resist state-sponsored disinformation campaigns without relying on platform moderation.
JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation: Inspired by the left-brain/right-brain division of semantic understanding and spatial cognition in humans, this paper proposes JanusVLN—the first dual implicit neural memory framework designed for VLN—which models spatial-geometric memory and visual-semantic memory respectively as fixed-size KV Caches, enabling efficient spatial reasoning from RGB video alone and achieving state-of-the-art performance on the VLN-CE benchmark.
JULI: Jailbreak Large Language Models by Self-Introspection: This paper reveals that top-k token log probabilities returned by aligned LLM APIs still contain harmful knowledge leakage, and proposes JULI—a BiasNet plugin with fewer than 1% of the target model's parameters—that manipulates logit bias to successfully jailbreak Gemini-2.5-Pro (Harmful Info Score 4.19/5) under API settings restricted to top-5 token probabilities, achieving approximately 140× speedup over LINT while doubling harmfulness scores.
MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation: Inspired by the dual-memory system in cognitive science, this paper proposes MemoryVLA, a framework that introduces a Perceptual-Cognitive Memory Bank (PCMB) into VLA models. By incorporating memory retrieval, gated fusion, and consolidation mechanisms to capture long-horizon temporal dependencies, MemoryVLA comprehensively outperforms CogACT and π₀ across 150+ tasks on SimplerEnv, LIBERO, and real-world benchmarks.
ODESteer: A Unified ODE-Based Steering Framework for LLM Alignment: This paper proposes a unified theoretical framework for activation steering based on ordinary differential equations (ODEs), interpreting conventional activation addition as the Euler discretization of an ODE and showing that steering direction identification is equivalent to defining a barrier function. Building on this insight, the authors design ODESteer, which achieves fine-grained steering by numerically solving the ODE with multi-step adaptive integration, yielding gains of 5.7% on TruthfulQA, 2.5% on UltraFeedback, and 2.4% on RealToxicityPrompts.
OmniEVA: Embodied Versatile Planner via Task-Adaptive 3D-Grounded and Embodiment-aware Reasoning: This paper proposes OmniEVA, which addresses two critical gaps in spatial MLLMs — poor geometric adaptability (2D-only or hard-coded 3D injection) and the absence of embodiment constraints (plans that are theoretically feasible but physically unexecutable) — via a task-adaptive gated router that dynamically injects 3D positional encodings only when geometric reasoning is required, and an embodiment-aware reasoning framework that integrates physical constraints into the planning loop. OmniEVA achieves state-of-the-art performance on 7 out of 8 benchmarks.
On Entropy Control in LLM-RL Algorithms: This paper provides a theoretical explanation for why conventional entropy regularization is nearly ineffective in LLM-RL (due to the extremely large action space and sparse optimal actions causing entropy bias to overwhelm optimization gains), and proposes AEnt — a method combining clamped entropy (computed over a reduced token space) with an adaptive coefficient — to effectively balance bias and benefit, consistently outperforming baselines on mathematical reasoning tasks.
One Demo Is All It Takes: Planning Domain Derivation with LLMs from A Single Demonstration: This paper proposes the PDDLLM framework, which derives a complete PDDL planning domain (predicates + actions) automatically from a single demonstration trajectory. It generates interpretable symbolic representations through cross-validation between LLM reasoning and physical simulation, and employs a Logical Constraint Adapter (LoCA) to automatically interface with motion planners. The method achieves at least 20% higher success rates than 6 LLM baselines across 1,200+ tasks in 9 environments, and is successfully deployed on 3 physical robot platforms.
PERSONA: Dynamic and Compositional Inference-Time Personality Control via Activation Vector Algebra: This paper proposes the PERSONA framework, which extracts approximately orthogonal personality vectors from activation space and applies vector algebra operations (scaling, addition, subtraction) to achieve training-free dynamic and compositional personality control. PERSONA attains a score of 9.60 on PersonalityBench, nearly matching the SFT upper bound of 9.61.
Real-Time Robot Execution with Masked Action Chunking: This paper proposes REMAC, which systematically addresses two key failure modes of asynchronous inference—intra-chunk inconsistency and inter-chunk discontinuity—through a masked action chunking training strategy and a prefix-preserved sampling pipeline, enabling more reliable real-time robot control without introducing any additional inference latency.
REI-Bench: Can Embodied Agents Understand Vague Human Instructions in Task Planning?: This work presents the first systematic study on how referring expressions (REs) in vague human instructions affect LLM-based robot task planning. REI-Bench is introduced to model 9 levels of coreference ambiguity (3 RE difficulty levels × 3 context types). Implicit REs are found to reduce the success rate of existing planners by up to 36.9%. The proposed Task-Oriented Context Cognition (TOCC) method decouples task understanding from planning decision-making, achieving an average improvement of 6.5% in success rate.
RF-MatID: Dataset and Benchmark for Radio Frequency Material Identification: This paper introduces RF-MatID, the first open-source large-scale RF material identification dataset with wide frequency coverage (4–43.5 GHz) and diverse geometric perturbations, comprising 16 fine-grained material categories (5 superclasses) and 142K samples. A comprehensive benchmark is established across 9 deep learning models, 5 frequency protocols, and 7 data split settings.
RoboCasa365: A Large-Scale Simulation Framework for Training and Benchmarking Generalist Robots: RoboCasa365 constructs a large-scale simulation benchmark comprising 365 everyday kitchen tasks, 2,500 diverse kitchen scenes, and over 2,000 hours of robot interaction data. It systematically evaluates generalist robot policies under three paradigms—multi-task learning, foundation model training, and lifelong learning—and finds that task diversity in pretraining data is the key factor for improving downstream generalization.
RoboInter: A Holistic Intermediate Representation Suite Towards Robotic Manipulation: This paper presents RoboInter, a unified manipulation suite for intermediate representations, comprising: RoboInter-Tool (a semi-automatic annotation GUI), RoboInter-Data (230K episodes × 571 scenes with dense per-frame annotations across 10+ intermediate representation types), RoboInter-VQA (a 29-category embodied VQA benchmark), and RoboInter-VLA (a plan-then-execute framework supporting both modular and end-to-end variants), providing a complete infrastructure for enhancing VLA generalization through intermediate representations.
RoboPARA: Dual-Arm Robot Planning with Parallel Allocation and Recomposition Across Tasks: This paper proposes RoboPARA, a two-stage framework that optimizes task parallelism for dual-arm robots via dependency graph construction and graph re-traversal scheduling, achieving 30–50% reduction in execution time and a 34% improvement in success rate over existing methods across multi-scenario benchmarks.
SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests: This paper introduces SocialHarmBench, the first LLM safety evaluation benchmark specifically targeting sociopolitical harms. It comprises 585 prompts spanning 7 categories and 34 countries, revealing systemic safety vulnerabilities in current LLMs across politically sensitive scenarios such as historical revisionism and propaganda manipulation.
Sparse Imagination for Efficient Visual World Model Planning: This paper proposes Sparse Imagination, which achieves substantial inference speedup in ViT patch token-based world model planning by randomly dropping tokens and training with randomly grouped attention (50% drop rate reduces planning time by ~50%), while maintaining or even surpassing full-token planning performance on certain tasks. A key finding is that simple random dropout outperforms sophisticated token selection methods, as static importance ranking suffers from a "blind spot problem" in dynamic planning scenarios.
String Seed of Thought: Prompting LLMs for Distribution-Faithful and Diverse Generation: This paper proposes String Seed of Thought (SSoT), a concise prompting method that instructs LLMs to first generate a random string and then extract randomness from it to select an answer. SSoT significantly improves distribution faithfulness in probabilistic instruction following (PIF) and response diversity in open-ended generation (DAG). The paper theoretically proves that TV distance decays exponentially with string length, and experiments show that reasoning-capable LLMs approach the performance of pseudo-random number generators.
SynthWorlds: Controlled Parallel Worlds for Disentangling Reasoning and Knowledge in Language Models: This paper constructs structurally identical parallel corpora in which entities are mapped to either real or synthetic names, and quantifies the Knowledge Advantage Gap (KA) — the contribution of parametric knowledge — by comparing model performance across the two "parallel worlds." The results show that this gap persists even when models are augmented with RAG and CoT.
Sysformer: Safeguarding Frozen Large Language Models with Adaptive System Prompts: This paper proposes Sysformer, a lightweight Transformer module that can be plugged in front of any frozen LLM to adaptively transform system prompts in embedding space conditioned on user input, enabling the model to refuse harmful requests while complying with benign ones—without modifying LLM parameters or filtering user inputs.

Test-Time Mixture of World Models for Embodied Agents in Dynamic Environments

Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration?: This paper proposes the Theory of Space framework, which systematically evaluates the ability of foundation models to construct and revise spatial beliefs through active exploration, cognitive map probing, and a False Belief paradigm across both text-based and visual environments. The study reveals critical failure modes in current state-of-the-art models, including active-passive performance gaps, inefficient exploration strategies, and deficient belief revision.
THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning: This paper proposes THOR, a framework that systematically addresses three core challenges in tool-integrated mathematical reasoning for LLMs—data construction, fine-grained optimization, and inference enhancement—through three complementary components: the TIRGen data construction pipeline, hierarchical reinforcement learning (joint episode-level and step-level optimization), and a self-correction inference mechanism. THOR achieves state-of-the-art performance among models of comparable scale on benchmarks including MATH500 and AIME.
Token Taxes: Mitigating AGI's Economic Risks: This paper proposes the Token Tax — a surcharge levied on model inference token usage — as a first-line governance instrument for mitigating economic risks in the post-AGI era. It leverages cloud computing providers as intermediaries through a three-stage audit pipeline (black-box token verification → norm-based tax rates → white-box audit). Compared to conventional robot taxes, it offers two distinctive advantages: enforceability through existing compute governance infrastructure, and collection at the point of AI token consumption rather than model hosting location, thereby alleviating global inequality.
Tracing and Reversing Edits in LLMs: Addressing the dual-use risks of Knowledge Editing (KE), this paper proposes EditScope, a method that infers edited target entities from post-edit weights with up to 99% accuracy, alongside a training-free edit reversal approach based on SVD bottom-rank approximation achieving up to 94% reversal rate—requiring only the post-edit weights, without access to the editing prompt or original weights.
TwinVLA: Data-Efficient Bimanual Manipulation with Twin Single-Arm Vision-Language-Action Models: TwinVLA is proposed as a modular framework that composes two pretrained single-arm VLAs into a bimanual VLA via joint attention and MoE, requiring only ~800h of public single-arm data, 50 episodes of bimanual fine-tuning data, and 25 H100 GPU-days—achieving performance comparable to π0, which relies on 10,900h of proprietary data and 1,000+ GPU-days.
UrbanVerse: Scaling Urban Simulation by Watching City-Tour Videos: UrbanVerse is a data-driven real-to-sim system that converts crowdsourced city-tour videos into physically-aware, interactive simulation environments. It comprises a 100K+ annotated 3D asset library and an automated scene construction pipeline, generating 160 high-quality scenes in IsaacSim. A PPO navigation policy trained on these scenes achieves an 89.7% success rate in zero-shot real-world transfer, completing a 337 m long-range task with only 2 human interventions.
Visual Planning: Let's Think Only with Images: This paper introduces Visual Planning — the first purely visual reasoning paradigm in which the entire planning process is expressed as a sequence of images without any textual intermediary. A Large Vision Model (LVM) autoregressively generates step-by-step state images. The authors further propose VPRL, a two-stage RL framework combining random-trajectory-initialized exploration with GRPO progress-reward optimization. On three navigation benchmarks (FrozenLake, Maze, MiniBehavior), VPRL achieves an average Exact Match (EM) surpassing text-based reasoning methods by 27%, demonstrating that image-based reasoning substantially outperforms text-based reasoning on vision-first tasks.
VLBiMan: Vision-Language Anchored One-Shot Demonstration Enables Generalizable Bimanual Robotic Manipulation: VLBiMan is proposed as a framework that decomposes a single demonstration into invariant and adaptable atomic skills via task-aware bimanual decomposition, employs vision-language anchoring with a VLM to adapt to new object positions and instances in novel scenes, and achieves bimanual coordination through kinematics-aware trajectory composition. The framework achieves an 85.3% success rate across 10 complex bimanual tasks with only one demonstration, substantially outperforming imitation learning baselines that require hundreds of demonstrations.
WebOperator: Action-Aware Tree Search for Autonomous Agents in Web Environment: This paper proposes WebOperator, an action-aware tree search framework that enables autonomous web agents to explore safely and efficiently in partially observable, irreversible real-world web environments through speculative backtracking, destructive action detection, action validation, and action merging. WebOperator achieves a 54.6% success rate on WebArena using gpt-4o, establishing a new state of the art.
What's the Plan? Metrics for Implicit Planning in LLMs and Their Application to Rhyme Generation and Question Answering: This paper proposes a mean activation difference steering method along with accompanying quantitative metrics, and systematically demonstrates across 23 open-source models (1B–32B) on rhyme generation and question answering: representations of target tokens (rhymes/answers) form at early sequence positions (forward planning) and causally influence intermediate token generation (backward planning). Implicit planning emerges as early as 1B-scale models, indicating it is a universal mechanism rather than a capability exclusive to large models.
When Agents Persuade: Propaganda Generation and Mitigation in LLMs: This paper systematically investigates propaganda generation behavior in LLMs, training dedicated detectors to quantify the use of six rhetorical techniques across three LLMs. Results show that all LLMs can generate propaganda and heavily rely on Loaded Language and Flag-Waving. Three fine-tuning approaches (SFT/DPO/ORPO) are employed for mitigation, with ORPO reducing the propaganda classification rate from 77% to 10% and decreasing rhetorical technique usage by 13.4×.
When would Vision-Proprioception Policies Fail in Robotic Manipulation?: This paper identifies why vision-proprioception manipulation policies fail during motion-transition phases—proprioceptive signals dominate optimization and suppress visual learning—and proposes the Gradient Adjustment with Phase-guidance (GAP) algorithm, which adaptively attenuates proprioceptive gradients to restore visual modality learning, achieving significant generalization improvements in both simulated and real-world environments.

💬 LLM / NLP¶

AP-OOD: Attention Pooling for Out-of-Distribution Detection: This paper proposes AP-OOD, which replaces the mean pooling in Mahalanobis distance-based OOD detection with learnable attention pooling, addressing the information loss caused by mean aggregation of token-level anomaly signals. On text OOD detection, AP-OOD reduces FPR95 on XSUM summarization from 27.84% to 4.67%, while supporting a smooth transition from unsupervised to semi-supervised settings.
AssetFormer: Modular 3D Assets Generation with Autoregressive Transformer: This paper proposes AssetFormer, an autoregressive Transformer-based framework for modular 3D asset generation. By designing graph-traversal token ordering, token set modeling, and a SlowFast decoding strategy, it generates high-quality architectural assets composed of discrete primitives from text descriptions, and introduces the first large-scale real-world modular 3D dataset (16k real + 4k synthetic samples).
AssetFormer: Modular 3D Assets Generation with Autoregressive Transformer: This paper proposes AssetFormer, an autoregressive Transformer based on the Llama architecture that models modular 3D assets (composed of primitive sequences) as discrete token sequences. Through DFS/BFS graph traversal reordering and joint vocabulary decoding, it enables the generation of modular 3D assets directly usable in game engines from text descriptions.
BOTS: A Unified Framework for Bayesian Online Task Selection in LLM Reinforcement Finetuning: This paper proposes BOTS—a unified Bayesian inference framework for online task selection in LLM reinforcement finetuning. BOTS integrates explicit evidence (historical pass rates from direct evaluation) and implicit evidence (difficulty estimates for unevaluated tasks inferred via reference model interpolation), combined with Thompson sampling for exploration–exploitation balance. The framework achieves up to 50% training speedup on math, code, and logic tasks with only 0.2% additional computational overhead.
Compositional-ARC: Assessing Systematic Generalization in Abstract Spatial Reasoning: This paper introduces the Compositional-ARC dataset to evaluate systematic generalization in abstract spatial reasoning—specifically, whether models can generalize from known primitive geometric transformations (e.g., translation, rotation) to unseen combinations thereof. A 5.7M-parameter encoder-decoder model trained with MLC achieves 78.26% exact match on the systematicity task, matching the ARC Prize 2024 champion (8B model + TTT) while vastly outperforming GPT-4o, o3-mini, and similar models (<3%).
d²Cache: Accelerating Diffusion-Based LLMs via Dual Adaptive Caching: This paper proposes d²Cache, a training-free approximate KV cache framework for diffusion-based LLMs (dLLMs), achieving 4.1× inference speedup while simultaneously improving generation quality via a two-stage strategy: deterministic prior-guided masked token selection followed by attention-aware non-masked token selection.
DreamOn: Diffusion Language Models For Code Infilling Beyond Fixed-size Canvas: DreamOn introduces two special states, [expand] and [delete], to overcome the fixed-length generation constraint of diffusion language models (DLMs), enabling variable-length code infilling without any architectural modification. It achieves an average improvement of 26.4% over diffusion baselines on HumanEval-Infilling, reaching performance on par with state-of-the-art autoregressive models.
ELLMob: Event-Driven Human Mobility Generation with Self-Aligned LLM Framework: This paper proposes ELLMob, a framework grounded in Fuzzy-Trace Theory (FTT) from cognitive psychology. By extracting and iteratively aligning "habit gist" and "event gist," the framework reconciles the competition between users' routine patterns and social event constraints, enabling interpretable event-driven trajectory generation.
ELLMob: Event-Driven Human Mobility Generation with Self-Aligned LLM Framework: This paper proposes ELLMob, a self-aligned LLM framework grounded in Fuzzy-Trace Theory (FTT), which generates human mobility trajectories that balance everyday routines with event-driven responses by extracting and iteratively aligning "habitual pattern gists" with "event constraint gists."
Enhancing Persona Following at Decoding Time via Dynamic Importance-Guided Token Estimation for Role-Playing Agents: This paper proposes Persona Dynamic Decoding (PDD), a framework that dynamically estimates the context-dependent importance of persona attributes via conditional mutual information and integrates importance scores into multi-objective reward-guided decoding, achieving training-free inference-time persona following.
Enhancing Persona Following at Decoding Time via Dynamic Importance Estimation for Role-Playing Agents: This paper proposes the Persona Dynamic Decoding (PDD) framework, which dynamically estimates the context-dependent importance of persona attributes via conditional mutual information and guides decoding with a weighted multi-objective reward, enabling training-free, adaptive persona following at inference time.
Enhancing Persona Following at Decoding Time via Dynamic Importance Estimation for Role-Playing Agents: This paper proposes PDD (Persona Dynamic Decoding), a framework that dynamically estimates the importance of persona attributes across different contexts via conditional mutual information, and guides decoding at inference time through a weighted multi-objective reward, achieving adaptive persona following without any fine-tuning.
Evaluating Text Creativity across Diverse Domains: A Dataset and Large Language Model Evaluator: This paper proposes a context-aware pairwise comparison framework for evaluating text creativity, constructs the CreataSet dataset comprising 100K+ human-annotated and 1M+ synthetic samples, and trains the CrEval evaluator, which surpasses GPT-4o by 18.7% in alignment with human judgments.
Fine-Grained Activation Steering: Steering Less, Achieving More: AUSteer reveals that block-level activation steering is inherently heterogeneous—different dimensions govern different token distributions, and steering the entire block simultaneously amplifies both beneficial and harmful signals. The paper proposes fine-grained steering at the Atomic Unit (AU) level: discriminative dimensions are identified via activation momentum, steering strength is adaptively allocated, and intervening on only ≤100 dimensions substantially outperforms state-of-the-art methods that steer thousands of dimensions.
First is Not Really Better Than Last: Evaluating Layer Choice and Aggregation Strategies in Language Model Data Influence Estimation: Through theoretical analysis and empirical experiments, this paper demonstrates that the widely accepted claim that "the first layer (embedding) is best suited for influence estimation" is unreliable. The work finds that intermediate attention layers are more effective, proposes two novel cross-layer aggregation strategies—Rank and Vote—along with a Noise Detection Rate (NDR) proxy metric, and achieves significant improvements in detecting harmful training samples in LLMs.
From Assumptions to Actions: Turning LLM Reasoning into Uncertainty-Aware Planning: This paper proposes PCE (Planner-Composer-Evaluator), a framework that explicitly extracts and organizes implicit environmental assumptions from LLM reasoning chains into decision trees, enabling uncertainty-aware action selection via a likelihood-gain-cost scoring function, thereby substantially reducing communication overhead in multi-agent collaboration.
FS-DFM: Fast and Accurate Long Text Generation with Few-Step Diffusion Language Model: This paper proposes FS-DFM (Few-Step Discrete Flow-Matching), which reduces the sampling steps of discrete flow-matching language models from 1024 to 8 through step-aware training and a cumulative scalar update rule, achieving a 128× speedup while maintaining comparable perplexity and generation quality.
Function Induction and Task Generalization: An Interpretability Study with Off-by-One Addition: Using off-by-one addition (e.g., 1+1=3, 2+2=5) as a counterfactual task, this work applies path patching to reveal a function induction mechanism within large language models — an attention head circuit that performs inductive reasoning at the function level, beyond token-level pattern matching — and demonstrates that this mechanism is reused across tasks.
GASP: Guided Asymmetric Self-Play For Coding LLMs: GASP introduces "goalposts" (hard target problems) into asymmetric self-play to guide the teacher in generating targeted training problems. Through a lemma (simplified variant) → lift (harder variant) curriculum structure, the framework progressively approaches difficult targets, surpassing unguided self-play by 2.5% on LiveCodeBench and solving hard problems that all baselines fail to solve.
Generative Value Conflicts Reveal LLM Priorities: This paper proposes ConflictScope, an automated pipeline for generating value-conflict scenarios. Through open-ended evaluation (rather than multiple-choice), it reveals LLMs' value priority rankings under conflict conditions. Key findings show that models shift from protective values (e.g., harmlessness) toward personal values (e.g., user autonomy) in open-ended settings, and that system prompts can improve alignment with target rankings by 14%.
How Catastrophic is Your LLM? Certifying Risk in Conversation: This paper proposes C3LLM (Certification of Catastrophic risks in multi-turn Conversation for LLMs), the first framework to provide statistical certification of catastrophic risks in multi-turn LLM conversations. It models conversation distributions as Markov processes over a semantic similarity graph, defines three conversation sampling strategies augmented with a jailbreak layer, and applies Clopper-Pearson 95% confidence intervals to certify the probability that a model produces harmful outputs—finding that the worst-performing model has a risk lower bound as high as 72%.
How Far Are LLMs from Professional Poker Players? Revisiting Game-Theoretic Reasoning with Agentic Tool Use: This paper systematically analyzes three core reasoning deficiencies of LLMs in poker (heuristic reasoning, factual misunderstanding, and knowing-doing gap), and proposes ToolPoker — the first tool-integrated LLM reasoning system for incomplete information games. By incorporating an external CFR solver to provide game-theoretically optimal action guidance, a 7B model approaches Nash equilibrium performance in Limit Hold'em.
Is the Reversal Curse a Binding Problem? Uncovering Limitations of Transformers from a Basic Generalization Failure: This paper proposes that the Reversal Curse is a manifestation of the cognitive science "binding problem" in Transformers—stemming from inconsistent and entangled concept representations—and for the first time designs an architecture based on JEPA and memory layers that genuinely overcomes (rather than circumvents) the Reversal Curse.
KVComm: Enabling Efficient LLM Communication through Selective KV Sharing: This paper proposes KVComm, a framework that enables efficient inter-LLM communication via selective KV pair sharing. It identifies an "information concentration bias" in hidden states that renders them unsuitable for cross-model transfer, and designs a layer selection strategy combining attention importance scores with a Gaussian prior. Transmitting only 30% of layers suffices to outperform most baselines.
LLEMA: Evolutionary Search with LLMs for Multi-Objective Materials Discovery: This paper proposes LLEMA, a framework that integrates LLM scientific knowledge with chemistry-rule-guided evolutionary search and memory-driven iterative optimization, achieving superior hit rates, stability, and Pareto front quality across 14 multi-objective materials discovery tasks.
LLEMA: Evolutionary Search with LLMs for Multi-Objective Materials Discovery: This paper proposes LLEMA, a framework that integrates the scientific prior knowledge of LLMs with chemistry-rule-guided evolutionary search and memory-driven iterative optimization, substantially outperforming generative and pure-LLM baselines across 14 multi-objective materials discovery tasks.
Meta-RL Induces Exploration in Language Agents: This paper proposes LaMer, a framework that introduces Meta-Reinforcement Learning (Meta-RL) into LLM agent training. By optimizing rewards across episodes and enabling context-based policy adaptation via self-reflection, LaMer equips language agents with active exploration capabilities, achieving absolute performance gains of 11%, 14%, and 19% on Sokoban, MineSweeper, and Webshop, respectively.
Near-Optimal Online Deployment and Routing for Streaming LLMs: This work provides the first formal treatment of the joint LLM streaming online deployment and routing problem, where new models continuously arrive and existing models may become obsolete. Under a concurrency deployment cap $M_{\max}$ and cost budget constraints, the paper proposes StageRoute, a hierarchical algorithm that achieves a provable $\tilde{\mathcal{O}}(T^{2/3})$ regret bound with a matching lower bound, establishing near-optimality.
Neural Synchrony Between Socially Interacting Language Models: This paper presents the first investigation of neural synchrony between LLMs engaged in social interaction. By training affine transformations to predict a partner model's future representations, it defines the $SyncR^2$ metric to quantify synchrony strength. The results show that synchrony depends on social engagement and temporal proximity, and correlates strongly with LLMs' social behavioral performance (Pearson $r$ = 0.88–0.99), echoing neuroscientific findings on inter-brain synchrony (IBS) in humans.
Optimas: Optimizing Compound AI Systems with Globally Aligned Local Rewards: This paper proposes Optimas, a framework that maintains a locally aligned reward function (LRF) per component in compound AI systems, enabling independent optimization of heterogeneous components (prompts, model parameters, hyperparameters, model selection), achieving an average improvement of 11.92% across five real-world systems.
Predicting LLM Reasoning Performance with Small Proxy Models: This paper proposes rBridge, a method that combines NLL evaluation on frontier-model reasoning traces with token-level task alignment weights, enabling models with ≤1B parameters to effectively predict the reasoning performance of 13B–32B models, reducing data ranking computation cost by over 100×.
PT2-LLM: Post-Training Ternarization for Large Language Models: This paper proposes PT2-LLM, the first post-training ternarization framework for LLMs. Through an asymmetric ternary quantizer (featuring iterative ternary fitting and activation-aware grid alignment) and a structural similarity reordering strategy, it achieves superior performance over 2-bit PTQ methods at 1.58-bit precision.
ConflictScope: Generative Value Conflicts Reveal LLM Priorities: This paper proposes ConflictScope — an automated pipeline for generating and evaluating value conflict scenarios: given an arbitrary set of values, it automatically generates conflict scenarios for each value pair and evaluates LLM value priority rankings through open-ended simulated user interactions (rather than multiple-choice questions). The study finds that models shift significantly from "protective values" (e.g., harmlessness) toward "personal values" (e.g., user autonomy) under open-ended evaluation, and that system prompts can improve alignment target rankings by 14%.
Rethinking Code Similarity for Automated Algorithm Design with LLMs: This paper proposes BehaveSim, an algorithm similarity metric based on Problem-Solving Trajectories (PSTrajs) and Dynamic Time Warping (DTW). BehaveSim measures algorithmic differences at the level of execution behavior rather than syntax or output, and when integrated into LLM-AAD frameworks such as FunSearch and EoH, yields significant performance improvements.
Rethinking Uncertainty Estimation in LLMs: A Principled Single-Sequence Measure: Starting from the proper scoring rules framework, this paper proves that the negative log-likelihood of the highest-probability output sequence (MSP) is a theoretically grounded uncertainty measure, and proposes G-NLL — a method that approximates this measure with a single greedy decoding pass, matching or surpassing SOTA methods that require multiple samples across several benchmarks.
Statistical Advantage of Softmax Attention: Insights from Single-Location Regression: By proposing the Single-Location Regression (SLR) theoretical framework and employing the order parameter method from statistical physics, this paper rigorously proves in the high-dimensional limit that softmax attention achieves the Bayes risk at the population level while linear attention fundamentally cannot. Under finite-sample regimes, softmax is shown to consistently outperform linear attention. This work provides the first principled explanation for the superiority of softmax attention in retrieval tasks.
Stopping Computation for Converged Tokens in Masked Diffusion-LM Decoding: This paper proposes SureLock, which permanently locks token positions in Masked Diffusion LMs once their posterior distributions stabilize after unmasking—skipping Q projection and FFN while caching KV—thereby reducing per-step attention computation from $O(N^2d)$ to $O(MNd)$. SureLock achieves 30–50% FLOPs reduction on LLaDA-8B without degrading generation quality.
The Lattice Representation Hypothesis of Large Language Models: This paper proposes the Lattice Representation Hypothesis (LRH) for LLMs: by unifying the Linear Representation Hypothesis with Formal Concept Analysis (FCA), it demonstrates that attribute directions in LLM embedding spaces implicitly encode a concept lattice via half-space intersections, thereby bridging continuous geometry and symbolic abstraction.
The Path of Least Resistance: Guiding LLM Reasoning Trajectories for Efficient Consistency: This paper proposes PoLR (Path of Least Resistance), the first inference-time method that exploits reasoning prefix consistency. By clustering short prefixes and expanding only the dominant cluster, PoLR serves as an efficient alternative to Self-Consistency, reducing token usage by up to 60% and latency by up to 50%.
Toward Safer Diffusion Language Models: Discovery and Mitigation of Priming Vulnerabilities: This paper identifies a priming vulnerability in masked diffusion language models (MDLMs)—injecting affirmative tokens at intermediate denoising steps can bypass safety guardrails—and proposes Recovery Alignment (RA), a training method that teaches models to recover safe responses from corrupted intermediate states.
Trapped by simplicity: When Transformers fail to learn from noisy features: This paper demonstrates that Transformers fail to learn Boolean functions from feature-noisy data. Their simplicity bias—a tendency to learn low-sensitivity functions—causes models to become trapped at optimal noisy predictors that are simpler than the target function, preventing recovery of the true noiseless target.
Unsupervised Evaluation of Multi-Turn Objective-Driven Interactions: Three unsupervised metrics are proposed—LLM-guided clustering (goal identification), interaction completeness detection via fine-tuned completion models, and response trees (LLM uncertainty quantification)—for evaluating multi-turn objective-driven dialogues without labeled data or LLM-as-a-judge, achieving performance that matches or exceeds a 70B judge using only an 8B model.
WebDevJudge: Evaluating (M)LLMs as Critiques for Web Development Quality: This work introduces WebDevJudge, a meta-evaluation benchmark that systematically assesses the ability of LLMs/MLLMs and agentic workflows to serve as judges for web development quality. Results reveal an approximately 15% agreement gap between the strongest current models and human experts, and identify two fundamental bottlenecks: failure to recognize functional equivalence and inadequate feasibility verification.
Weight Decay may matter more than μP for Learning Rate Transfer in Practice: Through large-scale empirical analysis, this paper demonstrates that the core alignment assumption of μP holds only briefly at the start of training. In practice, it is independent weight decay rather than μP that correctly stabilizes feature learning dynamics across widths, and the practical benefits of μP can be reinterpreted as a form of implicit learning rate warmup.
When Stability Fails: Hidden Failure Modes of LLMs in Data-Constrained Scientific Decision-Making: Through a controlled behavioral evaluation framework, this paper identifies four hidden failure modes of LLMs in data-constrained scientific decision-making tasks: high stability ≠ correctness, prompt-wording sensitivity, over-selection under relaxed thresholds, and hallucination of invalid identifiers.
When Stability Fails: Hidden Failure Modes of LLMs in Data-Constrained Scientific Decision-Making: This paper reveals hidden failure modes of LLMs in data-constrained scientific decision-making tasks: models can exhibit near-perfect run-to-run stability while systematically diverging from statistical ground truth, manifesting as over-selection, prompt sensitivity, and hallucinated gene identifiers.

📐 Optimization & Theory¶

A Convergence Analysis of Adaptive Optimizers under Floating-Point Quantization: This paper establishes the first theoretical framework for analyzing the convergence of adaptive optimizers under floating-point quantization. By imposing a relative error quantization model simultaneously on gradients, weights, and optimizer states (first and second moments), it proves that quantized Adam and Muon achieve the same $\tilde{O}(T^{-1/4})$ convergence rate as their full-precision counterparts when the mantissa length grows only logarithmically in the number of iterations. The analysis further reveals that Adam is highly sensitive to the quantization of weights and second moments, whereas Muon is theoretically more robust.
Adaptive Rollout Allocation for Online RL with Verifiable Rewards (VIP): This paper proposes VIP (Variance-Informed Predictive allocation), which uses a Gaussian process to predict the success probability of each prompt and then solves a convex optimization problem to allocate rollout counts under a compute budget constraint, minimizing gradient variance. VIP consistently improves the sampling efficiency of GRPO/RLOO on mathematical reasoning tasks, achieving up to 12.3-point gains in Pass@32 on AIME24/25.
Celo2: Towards Learned Optimization Free Lunch: This paper proposes Celo2—a learned optimizer meta-trained in only 4.5 GPU hours—that achieves stable generalization to models up to 1 billion parameters (GPT-3 XL, 1.3B), which is 6 orders of magnitude beyond the meta-training distribution, via simple recipes including a normalized MLP update rule and task augmentation. Celo2 outperforms the prior VeLO optimizer (which required 4,000 TPU-months of meta-training) and carefully tuned AdamW baselines.
CogFlow: Bridging Perception and Reasoning through Knowledge Internalization for Visual Mathematical Problem Solving: CogFlow proposes a cognition-inspired three-stage visual mathematical reasoning framework (Perception → Internalization → Reasoning), enhanced by Synergistic Visual Rewards for perception, a Knowledge Internalization Reward to bridge perception and reasoning, and Visual-Gated Policy Optimization to anchor visual reasoning, addressing the core problem of "correct perception but drifted reasoning" in existing methods.
COLD-Steer: Steering Large Language Models via In-Context One-step Learning Dynamics: COLD-Steer is proposed as a training-free LLM activation steering method that approximates the representational change induced by one step of gradient descent on in-context examples, achieving 95% steering effectiveness with only 1/50 of the samples required by prior methods.
Constraint Matters: Multi-Modal Representation for Reducing Mixed-Integer Linear Programming: This paper proposes a constraint-reduction framework for simplifying MILP models. It defines a fixed-constraint strength $\rho$ and uses information gain $\Delta H = -\log\rho$ to identify critical tight constraints (CTCs). A multi-modal GNN combining an instance-level bipartite graph with an abstract-level type graph is designed to predict CTCs. On four large-scale benchmarks, the method achieves an average improvement of 51.06% in solution quality ($\text{gap}_\text{abs}$) and an average speedup of 17.47% in convergence (PDI).
Converge Faster, Talk Less: Hessian-Informed Federated Zeroth-Order Optimization: This paper proposes HiSo (Hessian-informed Scalar-only communication), which leverages a global diagonal Hessian approximation to accelerate convergence in federated zeroth-order optimization while strictly maintaining scalar communication without transmitting any second-order information. Theoretical analysis proves that, under low effective rank and whitening assumptions, the convergence rate is independent of the Lipschitz constant $L$ and model dimension $d$. Experiments on OPT-350M/1.3B/2.7B fine-tuning demonstrate a 1.4–5.4× reduction in communication rounds, with total communication cost remaining at the KB level.
Convergence of Muon with Newton-Schulz: This work provides the first convergence guarantees for the Muon optimizer as it is actually used in practice—with Newton-Schulz (NS) approximation rather than exact SVD-based polar decomposition. It proves that the convergence rate matches the idealized SVD variant up to a constant factor $C_q$ that decays doubly exponentially in the number of NS iterations $q$, and that Muon enjoys a $\sqrt{r}$ advantage over its vector-space counterpart SGD-M due to reduced rank loss.
Convex Dominance in Deep Learning I: A Scaling Law of Loss and Learning Rate: Grounded in convex optimization theory, this paper proves that training loss in deep learning converges at a rate of $O(1/\sqrt{T})$ and that the optimal learning rate scales as $1/\sqrt{T}$. The resulting scaling law is validated across models ranging from GPT-2 to 12.5B parameters ($R^2 \geq 0.978$), enabling learning rate extrapolation across an 80× range of training steps.
DeepAFL: Deep Analytic Federated Learning: This paper proposes DeepAFL, which designs gradient-free analytic residual blocks and introduces a layer-wise federated training protocol, achieving for the first time a deep analytic federated learning model with representation learning capability. The method maintains ideal invariance to data heterogeneity while overcoming the fundamental limitation of existing analytic approaches to single-layer linear models, surpassing the state of the art by 5.68%–8.42% across three benchmark datasets.
Directional Convergence, Benign Overfitting of Gradient Descent in leaky ReLU two-layer Neural Networks: This paper provides the first proof of directional convergence of gradient descent in leaky ReLU two-layer neural networks, and on this basis establishes sufficient conditions for benign overfitting under a significantly broader mixed-data setting that goes well beyond nearly orthogonal data, while uncovering a novel phase transition phenomenon.
Dual Optimistic Ascent (PI Control) is the Augmented Lagrangian Method in Disguise: This paper proves that dual optimistic ascent (PI control), widely used in constrained deep learning, is mathematically equivalent to the classical Augmented Lagrangian Method (ALM) under the single-step first-order update regime. This equivalence transfers ALM's robust convergence guarantees (linear convergence to all strict local solutions) to PI control, and provides principled tuning guidance for the optimism coefficient $\omega$.
Exploring Diverse Generation Paths via Inference-time Stiefel Activation Steering: This paper proposes STARS (Stiefel-based Activation Steering for Diverse ReaSoning), a training-free inference-time activation steering method that jointly optimizes $N$ orthogonal steering directions on the Stiefel manifold at each token decoding step, maximizing the geometric volume of hidden states to promote divergent activation trajectories. STARS consistently outperforms temperature sampling in diversity on test case generation (TestEval) and scientific discovery (LiveIdeaBench) with negligible latency overhead and without sacrificing output quality.
Faster Gradient Methods for Highly-Smooth Stochastic Bilevel Optimization: By reinterpreting the F2SA method as a forward-difference approximation to the hyper-gradient, this paper proposes the F2SA-p algorithm family based on higher-order finite differences. Under higher-order smoothness conditions, the SFO complexity for stochastic bilevel optimization is improved from $\tilde{\mathcal{O}}(\epsilon^{-6})$ to $\tilde{\mathcal{O}}(p\epsilon^{-4-2/p})$, and a matching lower bound of $\Omega(\epsilon^{-4})$ is established, showing near-optimality for sufficiently large $p$.
FedDAG: Clustered Federated Learning via Global Data and Gradient Integration for Heterogeneous Environments: FedDAG is a clustered federated learning framework that performs more accurate client clustering via weighted class-wise similarity fusion of data and gradient signals, and enables cross-cluster feature transfer through a dual-encoder architecture, consistently outperforming existing baselines across diverse heterogeneity settings.
FrontierCO: Real-World and Large-Scale Evaluation of Machine Learning Solvers for Combinatorial Optimization: FrontierCO is a large-scale, real-world benchmark covering 8 categories of combinatorial optimization problems (TSP, MIS, CVRP, etc.), evaluating 16 ML solvers (neural methods + LLM agents) against state-of-the-art classical solvers. The benchmark reveals that ML methods remain significantly behind classical approaches on structurally complex and extremely large-scale instances, though they show potential to surpass classical methods in certain scenarios.
Generalization Below the Edge of Stability: The Role of Data Geometry: This paper introduces the principle of data shatterability to provide a unified explanation of how data geometry governs the strength of implicit regularization induced by gradient descent near the Edge of Stability (EoS). For the Beta(α) radial distribution family, the authors derive a spectrum of generalization upper and lower bounds that depend on α. For mixture distributions supported on low-dimensional subspaces, they prove that the generalization rate adapts to the intrinsic dimension $m$ rather than the ambient dimension $d$.
Learning to Recall with Transformers Beyond Orthogonal Embeddings: This paper analyzes the "early phase" of empirical gradient descent for a single-layer Transformer on a token retrieval task under random (non-orthogonal) embeddings. It derives an explicit formula for storage capacity, revealing a multiplicative dependence among sample size $N$, embedding dimension $d$, and sequence length $L$, and proves that this scaling relation is intrinsic to the information-theoretic lower bound.
Learning to Solve Orienteering Problem with Time Windows and Variable Profits: This paper proposes DeCoST, a learning-based two-stage framework that decouples the coupled discrete routing decisions and continuous service time allocation in OPTWVP. The first stage employs a parallel decoder to jointly generate routes and initial service times, while the second stage applies LP to optimally allocate service times (globally optimal). Cross-stage coordination is achieved via pTAR feedback. DeCoST achieves optimality gaps of only 0.83%–3.31% on OPTWVP instances with 50–500 nodes, with inference up to 45× faster than metaheuristics.
Markovian Transformers for Informative Language Modeling: This paper proposes the Markovian Language Model (MLM) framework, which enforces CoT to serve as a causally necessary reasoning bottleneck through structural constraints (removing the original question during answer prediction, so that the answer is derived solely from the CoT). Analogous to the narrow latent layer in an autoencoder, this approach is combined with GRPO-style policy gradient training, improving accuracy on GSM8K from 19.6% to 57.1%. The learned CoT also transfers across model architectures (Llama→Mistral/Phi/GPT-2), demonstrating that CoT encodes natural language reasoning rather than steganography.
Minor First, Major Last: A Depth-Induced Implicit Bias of Sharpness-Aware Minimization: This paper provides a rigorous theoretical analysis of the implicit bias of SAM when training linear diagonal networks, revealing a qualitative phase transition induced by increasing depth from $L=1$ to $L=2$: the limiting direction of $\ell_\infty$-SAM is highly sensitive to initialization, while $\ell_2$-SAM exhibits a sequential feature amplification phenomenon — "minor first, major last" — demonstrating that analyses focused solely on the $t\to\infty$ limit are insufficient to characterize the full dynamics of SAM.
MT-DAO: Multi-Timescale Distributed Adaptive Optimizers with Local Updates: This paper proposes MT-DAO, a multi-timescale distributed adaptive optimizer that introduces a slow momentum (high $\beta$) to address the timescale mismatch caused by excessively rapid decay of standard momentum under infrequent communication. MT-DAO provides the first convergence guarantee in this setting, eliminates the performance gap with fully synchronized DDP in language model pre-training, and reduces end-to-end training time by 6–27%.
∇-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space: This paper proposes ∇-Reasoner, which upgrades inference-time search from zeroth-order (sampling + evaluation) to first-order (gradient descent). By applying Differentiable Textual Optimization (DTO) in the token logits space—jointly leveraging reward gradients and LLM likelihood—the framework iteratively refines the decoding strategy, achieving 10–40% accuracy gains on mathematical reasoning tasks while reducing model calls by 10–40%.
Neural Networks Learn Generic Multi-Index Models Near Information-Theoretic Limit: This paper proves that, under generic non-degeneracy assumptions, standard two-layer neural networks trained via hierarchical gradient descent can learn generic Gaussian Multi-Index models $f(\bm{x})=g(\bm{U}\bm{x})$ with $\tilde{O}(d)$ samples and $\tilde{O}(d^2)$ time, achieving information-theoretically optimal sample and time complexity. This constitutes the first proof that neural networks can efficiently learn hierarchical functions.
Non-Asymptotic Analysis of Efficiency in Conformalized Regression: This work establishes the first non-asymptotic efficiency bounds for Conformalized Quantile Regression (CQR) and Conformalized Median Regression (CMR) trained with SGD, explicitly characterizing the joint dependence of prediction set length bias on training sample size $n$, calibration sample size $m$, and miscoverage level $\alpha$.
Nonparametric Teaching of Attention Learners: This paper proposes AtteNT — a reinterpretation of attention learner (Transformer/ViT) training through the lens of nonparametric teaching theory: it analyzes the importance-adaptive role of attention in parametric gradients → proves that the dynamic ANTK converges to an importance-adaptive canonical kernel in functional gradient descent → bridges parameter space and function space → applies a greedy teaching algorithm that selects samples with the largest prediction deviation to accelerate training → achieving 13.01% time savings for LLM fine-tuning and 20.58% for ViT training from scratch, with no degradation in accuracy.
NRGPT: An Energy-based Alternative for GPT: This paper proposes NRGPT (eNeRgy-GPT), which applies minimal modifications to standard GPT to yield an energy-based model: attention and feedforward energy functions are designed such that each forward pass is equivalent to a gradient descent step on the energy landscape. The work proves asymptotic energy decrease and stable convergence properties, and validates performance comparable to standard GPT on ListOps, Shakespeare, and OpenWebText.
Optimizer Choice Matters for the Emergence of Neural Collapse: Through 3,900+ training experiments and theoretical analysis, this paper reveals that optimizer choice—particularly the coupling mechanism of weight decay—plays a decisive role in the emergence of Neural Collapse: AdamW (decoupled weight decay) fails to produce Neural Collapse, whereas SGD and Adam (coupled weight decay) succeed.
Personalized Collaborative Learning with Affinity-Based Variance Reduction: This paper proposes AffPCL, a personalized collaborative learning framework that enables heterogeneous agents to collaboratively learn personalized solutions without prior knowledge, via bias correction and importance correction mechanisms. It achieves an adaptive convergence rate of $O(t^{-1} \cdot \max\{n^{-1}, \delta\})$—yielding linear speedup when agents are similar, and no worse than independent learning when they are dissimilar.
Πnet: Optimizing Hard-Constrained Neural Networks with Orthogonal Projection Layers: This paper proposes the Πnet architecture, which appends a Douglas-Rachford operator splitting-based orthogonal projection layer to the output of a neural network to guarantee strict satisfaction of convex constraints, and employs the implicit function theorem for efficient backpropagation. Πnet substantially outperforms existing methods in training time, solution quality, and hyperparameter robustness.
Provable and Practical In-Context Policy Optimization for Self-Improvement: This paper proposes the In-Context Policy Optimization (ICPO) framework, theoretically proving that a single-layer linear self-attention Transformer, after sufficient pretraining, can simulate a policy optimization algorithm in context. Building on this, the paper designs a practical ME-ICPO algorithm that achieves multi-round test-time self-reflection via minimum-entropy selection and self-evaluation rewards, yielding significant gains on mathematical reasoning tasks (Qwen2.5-Math-7B improves from 11% to 30% on AIME 2024).
Rapid Training of Hamiltonian Graph Networks using Random Features: This paper proposes RF-HGN, which constructs dense layer parameters via random feature sampling (ELM/SWIM) and solves a linear least-squares problem to train Hamiltonian Graph Networks. By completely bypassing gradient-descent-based iterative optimization, RF-HGN achieves 150–600× speedup on N-body physical systems while maintaining comparable accuracy and strong zero-shot generalization.
Rethinking Consistent Multi-Label Classification Under Inexact Supervision: The paper proposes the COMES framework, which provides consistent risk estimators for multi-label classification under inexact supervision via first-order (Hamming loss) and second-order (Ranking loss) strategies, without requiring estimation of the label generation process or uniform distribution assumptions.
Rolling Ball Optimizer: Learning by Ironing Out Loss Landscape Wrinkles: This paper proposes the Rolling Ball Optimizer (RBO), which breaks the spatial locality of conventional optimizers by simulating the rolling motion of a finite-radius rigid sphere over the loss landscape. This induces an ironing property on the loss function and demonstrates superior convergence speed and generalization performance on MNIST and CIFAR-10/100.
RRNCO: Towards Real-World Routing with Neural Combinatorial Optimization: This paper proposes the RRNCO architecture, which introduces two key innovations — Adaptive Node Embedding (ANE) and Neural Adaptive Bias (NAB) — to jointly model asymmetric distances, travel durations, and bearing angles within a deep routing framework for the first time. It also constructs a VRP benchmark dataset based on 100 real-world cities, significantly narrowing the sim-to-real gap for NCO methods.
RS-ORT: A Reduced-Space Branch-and-Bound Algorithm for Optimal Regression Trees: This paper proposes RS-ORT, an algorithm that reformulates regression tree training as a two-stage optimization problem and applies branch-and-bound over a reduced space (branching only on tree structure variables). Combined with closed-form leaf predictions, threshold discretization, and exact last-layer subtree resolution, RS-ORT is the first exact method to achieve globally optimal regression tree learning on datasets with up to 2 million samples containing continuous features.
Saddle-to-Saddle Dynamics Explains A Simplicity Bias Across Neural Network Architectures: This paper proposes a unified theoretical framework that explains the pervasive simplicity bias observed across multiple neural network architectures (fully connected, convolutional, and attention-based) through saddle-to-saddle learning dynamics — the phenomenon whereby gradient descent tends to learn simple solutions first before progressively learning more complex ones.
Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning: This paper proposes the Scaf-GRPO framework, which injects hierarchical in-prompt hints (Knowledge → Planning → Solution) to overcome the "learning cliff" (zero-reward) problem in GRPO training. On Qwen2.5-Math-7B, it achieves a 44.3% relative improvement in pass@1 on AIME24 while preserving on-policy training consistency.
Scaling Laws of SignSGD in Linear Regression: When Does It Outperform SGD?: Under the Power-Law Random Features model, this paper systematically analyzes the scaling laws of SignSGD, reveals two distinctive effects of SignSGD relative to SGD—drift normalization and noise reshaping—and demonstrates that the compute-optimal scaling exponent of SignSGD can surpass that of SGD in noise-dominated regimes.
SCRAPL: Scattering Transform with Random Paths for Machine Learning: To address the prohibitive computational cost of using the multivariate scattering transform (ST) as a differentiable loss function due to its large number of paths $P$, this paper proposes SCRAPL—a framework that samples a single random path per training step and stabilizes gradient updates via three variance-reduction techniques: P-Adam (path-adaptive momentum), P-SAGA (path stochastic average gradient), and $\theta$-importance sampling. On unsupervised sound matching tasks, SCRAPL achieves Pareto optimality by attaining accuracy close to the full-path ST at computational costs comparable to multi-scale spectral (MSS) loss.
Test-Time Meta-Adaptation with Self-Synthesis: This paper proposes MASS (Meta-Adaptation with Self-Synthesis), a framework that employs bilevel optimization-based meta-learning to enable LLMs to generate task-specific synthetic training data at inference time via a Generator, filter samples through a Scorer, and perform weighted SFT self-update via LoRA. Meta-gradients are backpropagated through the inner update to optimize data quality, improving Llama-3.1-8B from 43.6% to 59.0% on MATH-500.
The Affine Divergence: Aligning Activation Updates Beyond Normalisation: This paper reveals a fundamental misalignment between the steepest descent direction in parameter space and the effective update propagated to activations under gradient descent — the "affine divergence" $\Delta\mathcal{L}/\Delta z_i = (\partial\mathcal{L}/\partial z_i) \cdot (\|\vec{x}\|^2+1)$ — derives normalization as the natural remedy from first principles, and discovers a non-normalizing alternative that empirically surpasses conventional normalization methods.
Unifying Formal Explanations: A Complexity-Theoretic Perspective: This paper proposes a unified framework that reduces sufficient reasons and contrastive reasons (local/global, probabilistic/non-probabilistic) to the problem of minimizing a unified probabilistic value function. It reveals that global value functions possess key combinatorial optimization properties—monotonicity, submodularity/supermodularity—and proves that global explanations are computable in polynomial time, even when their local counterparts are NP-hard.
Weak-SIGReg: Covariance Regularization for Stable Deep Learning: This work transfers SIGReg regularization from LeJEPA's self-supervised learning setting to supervised learning and proposes a computationally efficient variant called Weak-SIGReg—constraining the covariance matrix toward the identity (rather than matching all moments). Random projections reduce memory from $O(C^2)$ to $O(CK)$. On a ViT without BN or residual connections, this approach recovers CIFAR-100 accuracy from 20.73% (collapsed) to 72.02%, matching or surpassing carefully tuned baselines.
When to Restart? Exploring Escalating Restarts on Convergence: This paper proposes SGD-ER (SGD with Escalating Restarts), a convergence-aware learning rate scheduling strategy that triggers restarts with linearly escalating learning rates upon detecting training stagnation, enabling the optimizer to escape sharp local minima and explore flatter loss landscape regions. SGD-ER achieves 0.5–4.5% test accuracy improvements on CIFAR-10/100 and TinyImageNet.

⚖️ Alignment & RLHF¶

A2D: Any-Order, Any-Step Safety Alignment for Diffusion Language Models: A2D is proposed, a token-level safety alignment method for diffusion language models (dLLMs) that trains the model to output [EOS] tokens at masked positions containing harmful content, enabling robust defense across any decoding order and any decoding step. It reduces DIJA template attack success rates from 80%+ to near zero (1.3%/0.0%) while supporting early rejection for a 19.3× speedup.
Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignment: This paper proposes Multi-Lingual Consistency (MLC), an auxiliary loss that manipulates the singular values of a multilingual representation matrix via SVD to drive it toward rank-1 (i.e., collinear multilingual representations). Using only multilingual prompt translations—without requiring target-language responses—MLC consistently transfers safety alignment from one language to all others.
Alignment through Meta-Weighted Online Sampling: Bridging the Gap between Data Generation and Preference Optimization: This paper proposes MetaAPO, a framework that employs a lightweight meta-learner (a two-layer MLP) to dynamically estimate the alignment gap between offline and online data. The meta-learner simultaneously guides where to perform online sampling (addressing distribution mismatch) and adaptively reweights offline/online data during training (improving learning efficiency). MetaAPO outperforms DPO, Online DPO, and other baselines on AlpacaEval 2, Arena-Hard, and MT-Bench, while reducing online annotation costs by 42%.
AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint: This paper proposes AlphaSteer, which learns a null-space-constrained transformation matrix to dynamically construct steering vectors that produce near-zero vectors for benign inputs (preserving utility) while reconstructing the refusal direction vector for malicious inputs (enhancing safety), providing theoretical guarantees for the decoupling of safety and utility.
Antibody: Strengthening Defense Against Harmful Fine-Tuning for Large Language Models via Attenuating Harmful Gradient Influence: This paper proposes Antibody, a two-stage defense framework that (1) during alignment, applies flatness regularization to place the model in a flat region of the harmful loss landscape (small gradients → harder to attack), and (2) during fine-tuning, suppresses learning from harmful samples via a likelihood-ratio-based sample weighting scheme (contrasting the likelihood of task completion vs. refusal). The average Harmful Score is reduced from 15.29% to 7.04%.
AVERE: Improving Audiovisual Emotion Reasoning with Preference Optimization: To address spurious associations and hallucinations in multimodal large language models (MLLMs) for emotion reasoning, this work proposes the EmoReAlM evaluation benchmark and the AVEm-DPO preference optimization method. By constructing targeted preference pairs and incorporating text-prior regularization, the approach achieves 6–19% relative zero-shot performance gains on DFEW, RAVDESS, and EMER.
Beyond Pairwise: Empowering LLM Alignment With Ranked Choice Modeling: This paper proposes RCPO, a framework that extends LLM alignment from pairwise preference to ranked choice modeling. By unifying a utility model (MNL) and a ranking model (Mallows-RMJ) under MLE, RCPO outperforms DPO and its variants under both single-best and top-k feedback formats.
Beyond RLHF and NLHF: Population-Proportional Alignment under an Axiomatic Framework: This paper proposes a preference learning framework grounded in social choice theory axioms. It infers the feasible set of evaluator population distributions from pairwise comparison data and constructs policies satisfying two axioms: Population-Proportional Alignment (PPA) and Population-Bounded Manipulability (PBM).
CAGE: A Framework for Culturally Adaptive Red-Teaming Benchmark Generation: This paper proposes the CAGE framework, which decouples the adversarial structure of red-teaming prompts from their cultural content via a construct termed the Semantic Mold. CAGE systematically adapts English red-teaming benchmarks to diverse cultural contexts, yielding culturally grounded prompts that achieve substantially higher attack success rates (ASR) than direct translation.
Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training: This paper theoretically establishes that reward over-optimization stems primarily from misspecification in the high-reward tail region, and proposes a rubric-based reward modeling approach: leveraging off-policy data (high-quality responses from stronger models) to construct scoring rubrics, which are progressively refined by distinguishing "good vs. better" responses, effectively mitigating reward over-optimization.
Displacement-Resistant Extensions of DPO with Nonconvex $f$-Divergences: This paper establishes that the solvability of f-DPO does not require convexity of $f$ — only $\lim_{t\to 0^+} f'(t) = -\infty$ is needed — and further proves that $\arg\min f(t) \geq 1$ is a necessary condition for displacement resistance. Based on these findings, the paper proposes SquaredPO ($f(t) = \frac{1}{2}(\log t)^2$, nonconvex), which significantly alleviates the winner probability degradation problem while maintaining competitive performance.
Dual-IPO: Dual-Iterative Preference Optimization for Text-to-Video Generation: This paper proposes the Dual-IPO framework, which performs multi-round bidirectional iterative optimization between a reward model and a video generation model. Without large-scale human annotation, the approach continuously improves text-to-video generation quality and human preference alignment, enabling a 2B model to surpass a 5B model.
From Utterance to Vividity: Training Expressive Subtitle Translation LLM via Adaptive Local Preference Optimization: This paper proposes ALPO (Adaptive Local Preference Optimization) for training expressive subtitle translation LLMs. Three empirical findings motivate the design: (1) subtitle translation exhibits the lowest back-translation consistency, indicating the highest degree of paraphrase; (2) reasoning-type LLMs (R1/GPT-5 Thinking) produce more expressive paraphrases than chat-type LLMs (GPT-4o/Qwen-Max); (3) a 14B model used as a translation evaluator achieves Spearman correlation $\geq 0.82$ with human judgments, qualifying it as a low-cost reward model. Building on these findings, the paper proposes a fine-grained, process-supervised preference alignment method operating at the sentence-segment level (with adaptive weighting, dynamic beta, and prefix mixing). A 14B model trained with ALPO surpasses GPT-4o and DeepSeek-R1 in vividness across multiple subtitle translation directions.
General Exploratory Bonus for Optimistic Exploration in RLHF: This paper theoretically demonstrates that existing RLHF exploratory bonuses under KL and α-divergence regularization actually drive the policy toward high-probability regions of the reference model—contrary to the principle of optimism. It proposes the General Exploratory Bonus (GEB) framework, which introduces reference-model-dependent reward modulation to counteract the conservative bias induced by divergence regularization, and provably satisfies the optimism principle.
Group-Relative REINFORCE Is Secretly an Off-Policy Algorithm: Demystifying Some Myths About GRPO and Its Friends: By constructing a KL-regularized surrogate objective and deriving a pairwise consistency condition from first principles, this paper proves that group-relative REINFORCE (GRPO) is inherently an off-policy algorithm. Component isolation experiments further reveal that clipping is the sole driver of training stability while importance sampling can be entirely removed. Within this unified framework, the paper reinterprets several seemingly independent algorithms—including Kimi OPMD and Meta AsymRE—under a common theoretical lens.
GuardAlign: Test-time Safety Alignment in Multimodal Large Language Models: This paper proposes GuardAlign, a training-free test-time safety defense framework for multimodal large language models. It leverages optimal transport (OT) to precisely detect and mask unsafe regions in images, and employs cross-modal attention calibration to sustain the influence of safety prefixes across layers. Evaluated on six LVLMs, GuardAlign reduces unsafe response rates by up to 39% while preserving or improving general capability.
Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks: This paper identifies a "historical context inconsistency" problem in stepwise group-based RL methods (e.g., GRPO/GiGPO)—steps within the same group may have different historical contexts, leading to biased advantage estimation. HGPO is proposed to achieve low-bias, balanced-variance advantage estimation through hierarchical grouping and adaptive weighting, yielding significant improvements on ALFWorld and WebShop with negligible additional overhead (<0.001%).
Is On-Policy Data always the Best Choice for Direct Preference Optimization-based LM Alignment?: This paper challenges the prevailing assumption that on-policy data is always superior, revealing that LLM alignment comprises two distinct stages — preference injection (requiring high-diversity off-policy data) and preference fine-tuning (requiring high-quality on-policy data) — with the optimal data type varying across models and stages. A boundary detection algorithm incurring only 3.2% additional computational overhead is proposed and validated across 5 models × 55 configurations.
JailNewsBench: Multi-Lingual and Regional Benchmark for Fake News Generation under Jailbreak Attacks: This paper introduces JailNewsBench, the first multilingual and multi-regional benchmark for evaluating LLM robustness against fake news generation under jailbreak attacks. Covering 34 regions, 22 languages, and approximately 300,000 instances, the benchmark reveals attack success rates as high as 86.3% and exposes a systematic safety imbalance in which English- and U.S.-topic defenses are significantly weaker than those for other regions.
Learning More with Less: A Dynamic Dual-Level Down-Sampling Framework for Efficient Policy Optimization: This paper proposes D3S (Dynamic Dual-Level Down-Sampling), a framework that maximizes advantage variance at the sample level and prioritizes high-entropy, high-advantage tokens at the token level, combined with a dynamic scheduling strategy. D3S achieves faster convergence and superior performance using fewer than 20% of tokens.
Learning Ordinal Probabilistic Reward from Preferences (OPRM): This paper proposes the Ordinal Probabilistic Reward Model (OPRM), which discretizes response quality into ordinal grades from 1 to 9 and learns the full probability distribution over these grades. Combined with Region Flooding Tuning (RgFT), it enables data-efficient training. OPRM achieves 89.3% on RewardBench, outperforming existing reward models by 2.9%–7.4%, while also providing uncertainty estimation and annotation disagreement detection.
Mitigating Mismatch within Reference-based Preference Optimization: This paper identifies the premature satisfaction problem in DPO — when the reference policy assigns lower probability to chosen than to rejected responses (~45% of pairs), DPO's gradient is unnecessarily attenuated by the pessimistic reference signal, even when the policy is still incorrect (i.e., $\Delta_\theta < 0$). The paper proposes HyPO (a one-line code change: clipping the reference margin via $\max(0, \Delta_{ref})$), achieving a 41.2% relative improvement over DPO on AlpacaEval 2.0.
Mitigating the Safety Alignment Tax with Null-Space Constrained Policy Optimization: This paper proposes NSPO, which projects safety alignment policy gradients onto the null space of general-task representations, geometrically ensuring that safety optimization does not degrade general capabilities. Using only 40% of the safety training data, NSPO achieves state-of-the-art results across 7 safety benchmarks while incurring virtually no performance loss on mathematics, code generation, and instruction following.
No Prompt Left Behind: Exploiting Zero-Variance Prompts in LLM Reinforcement Learning via Entropy-Guided Advantage Shaping: This paper identifies that a large proportion of "zero-variance prompts" (where all sampled responses are either entirely correct or entirely incorrect) are silently discarded during GRPO training. The proposed RL-ZVP algorithm extracts learning signals from these prompts via entropy-guided advantage shaping, achieving improvements of up to 8.61 accuracy points and 7.77 pass-rate points over GRPO across six mathematical reasoning benchmarks.
Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search: This paper proposes CC-BOS, a framework that exploits the semantic compression and inherent ambiguity of Classical Chinese, combined with a Fruit Fly Optimization Algorithm to search an eight-dimensional strategy space for optimal jailbreak prompts, achieving nearly 100% attack success rate across six mainstream LLMs.
Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks: This paper systematically investigates how sparsity in MoE language models differentially affects memorization and reasoning tasks: memorization tasks favor higher sparsity (more parameters), while reasoning tasks peak near TPP≈20, and this trend remains consistent after GRPO post-training and increased test-time compute.
Reasoned Safety Alignment: Ensuring Jailbreak Defense via Answer-Then-Check: This paper proposes an Answer-Then-Check strategy: the model first generates an intended answer summary in its chain-of-thought, then conducts safety analysis against a safety policy, and finally decides whether to output or refuse. After training on the constructed 80K ReSA dataset, the method achieves a 99.3% defense rate against 7 jailbreak attacks (RL variant), with only 500 samples needed to match full-dataset performance.
PURGE: Reinforcement Unlearning via Group Relative Policy Optimization: PURGE reformulates LLM unlearning as a verifiable RL task, employing the GRPO framework with intrinsic reward signals (penalizing mentions of forbidden concepts) to achieve safe and consistent knowledge removal. It consumes 46× fewer tokens than the SOTA while improving fluency by +5.48% and adversarial robustness by +12.02%.
SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety: This work revisits the safety-constrained RLHF objective, proves the existence of a closed-form optimal policy, and derives an equivalent tractable objective, SafeDPO. The method requires only a safety-aware data transformation and a safety margin term (one additional hyperparameter) on top of standard DPO, without reward or cost models. It achieves a 96.87% harmlessness rate on PKU-SafeRLHF-30K while maintaining competitive helpfulness, and trains 25× faster than SafeRLHF.
Safety Subspaces are Not Linearly Distinct: A Fine-Tuning Case Study: Through four systematic experiments (parallel projection, orthogonal projection, subspace overlap, and activation space analysis) conducted across five open-source LLMs, this paper establishes a key finding: safety alignment behavior is highly entangled with general learning in both weight space and activation space, and no linearly separable independent safety subspace exists. Consequently, defense strategies based on subspace projection/filtering face fundamental limitations.
SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks: This paper proposes SEMA, a two-stage training framework consisting of prefilling self-tuning and RL with an intent-drift-aware reward. Without relying on any existing attack strategies or external data, SEMA trains an attacker capable of automatically generating multi-turn jailbreak attacks, achieving an average ASR@1 of 80.1% across three victim models on AdvBench — surpassing the prior state of the art by 33.9%.
Semantic-aware Wasserstein Policy Regularization for Large Language Model Alignment: This paper identifies a fundamental limitation of standard KL divergence regularization in RLHF: it compares token probabilities only at identical index positions, completely ignoring semantic similarity. The authors propose Wasserstein Policy Regularization (WPR), a semantic-aware policy regularization based on entropy-regularized Wasserstein distance. Through a dual formulation, WPR converts the regularization into token-level penalty terms compatible with standard RL algorithms such as PPO, and consistently outperforms KL divergence and various f-divergence baselines on dialogue generation and summarization tasks.
Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy: This paper proposes a two-stage preference data curation pipeline based on Human-AI synergy. Stage 1 accumulates approximately 1M preference pairs over 8 iterative rounds via human verification, error-driven adaptive retrieval, and preference-guided LLM annotation. Stage 2 scales the dataset to 26M pairs using dual-RM consistency filtering. The resulting Skywork-Reward-V2 8B model achieves 97.8% on RewardBench and an average of 88.6% across 7 mainstream benchmarks, comprehensively surpassing all open-source 70B reward models.
Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning: This paper proposes SFPO (Slow-Fast Policy Optimization), which decomposes each training step into a three-stage structure of "fast trajectory — reposition — slow correction." Without modifying the objective function or rollout procedure, SFPO serves as a plug-and-play enhancement to GRPO, achieving up to 2.80-point average improvement on mathematical reasoning benchmarks and up to 4.93× reduction in rollouts.
Superficial Safety Alignment Hypothesis: This paper proposes the Superficial Safety Alignment Hypothesis (SSAH): safety alignment is essentially teaching a model to perform an implicit binary classification task (execute vs. refuse), requiring only ~1.3% of neurons to establish safety guardrails. Freezing these safety-critical units during fine-tuning preserves safety, and leveraging redundant units as an "alignment budget" eliminates the alignment tax.
Swap-guided Preference Learning for Personalized RLHF (SPL): This paper addresses posterior collapse in Variational Preference Learning (VPL) by proposing SPL, which introduces swap-guided base regularization (forcing latent variables to encode user preferences rather than being ignored), a Preferential-IAF decomposition of swap-reversible and swap-invariant signals, and adaptive latent variable modulation. On Llama-3.1-8B, SPL achieves 63.71% accuracy and 97.10% active units, whereas VPL collapses to 57.14% accuracy and 0% active units.
Token-Importance Guided Direct Preference Optimization (TI-DPO): TI-DPO is proposed, which precisely quantifies each token's contribution to preference via a hybrid weighting mechanism combining gradient attribution and a Gaussian prior, and incorporates a triplet loss to guide optimization in a continuous semantic space. The method achieves state-of-the-art performance with an average score of 62.3 across 6 benchmarks, while providing interpretable token-level control.
Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models (UltraBreak): This paper proposes UltraBreak, which combines a semantic adversarial objective (replacing cross-entropy with cosine similarity to produce a smooth loss landscape) and input-space constraints (random transformations + TV regularization to yield transformation-invariant features) to optimize a single universal adversarial image capable of jailbreaking 6+ VLM architectures and commercial models. The average black-box ASR reaches 71% on SafeBench, substantially outperforming prior methods.
Towards Understanding Valuable Preference Data for Large Language Model Alignment: This work studies preference data quality from a model-dependent perspective. It proposes Truncated Influence Functions (TIF), revealing that data with medium IF values—rather than high IF values as conventionally assumed—is most valuable. Two lightweight proxy metrics, LossDiff and IRM, are designed to approximate TIF. The combined LossDiff-IRM selector achieves an average WinRate improvement of 13.58% using only 50–64% of the data, with consistent effectiveness across multiple LLM families and alignment benchmarks.
Uni-DPO: A Unified Paradigm for Dynamic Preference Optimization of LLMs: Uni-DPO is proposed to unify dynamic reweighting of preference pairs via three components — quality-aware weighting (prioritizing pairs with large score margins), performance-aware weighting (focal loss focusing on underfitted samples), and a calibrated NLL loss — consistently outperforming DPO/SimPO on text understanding and mathematical reasoning benchmarks, with Gemma-2-9B achieving 67.1% on Arena-Hard, surpassing Claude 3 Opus (60.4%).
Unifying Stable Optimization and Reference Regularization in RLHF (DAR): This paper proposes DAR (Dual-regularized Advantage Regression), identifying that reference-model regularization (for preventing reward hacking) and policy stability constraints (for preventing collapse) in standard RLHF progressively conflict, excessively restricting the optimization space. DAR addresses this via a dual-KL objective that interpolates reference policies in log-space and applies a regression transformation to eliminate policy-ratio instability, achieving an average win rate of 92.42% in direct AI alignment and standard RLHF settings, surpassing GRPO by 7.27%.
Why DPO is a Misspecified Estimator and How to Fix It: This paper proves from an information-geometric perspective that DPO is fundamentally a misspecified statistical estimator under parameterized (non-tabular) policy classes—DPO projects the true reward function onto the implicit reward manifold via KL projection, leading to preference reversal and reward degradation when the reward is unrealizable—and proposes AuxDPO, which introduces null-space auxiliary variables to remedy this misspecification.

� LLM Safety¶

Attention Smoothing Is All You Need For Unlearning: This paper proposes Attention Smoothing Unlearning (ASU), which constructs a forget-teacher by raising the softmax temperature in self-attention, reformulating the unlearning problem as self-distillation. By smoothing the attention distribution to weaken both lexical- and semantic-level associations, ASU erases memorized knowledge while preserving output coherence, surpassing existing unlearning methods on multiple benchmarks including TOFU, MUSE, and WMDP.
AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models: This paper proposes AudioTrust, the first multidimensional trustworthiness evaluation benchmark for audio large language models (ALLMs), encompassing six dimensions—fairness, hallucination, safety, privacy, robustness, and authentication—with 26 sub-tasks and 4,420+ audio samples. It systematically evaluates the trustworthiness boundaries of 14 state-of-the-art open- and closed-source ALLMs in high-stakes audio scenarios.
BiasBusters: Uncovering and Mitigating Tool Selection Bias in Large Language Models: This paper presents the first systematic study of bias in LLM tool selection. When multiple functionally equivalent APIs are available, LLMs systematically favor certain tools due to semantic alignment, positional effects, and pretraining exposure. The authors propose a total variation–based bias metric, a benchmark spanning 10 tool categories, and a lightweight debiasing strategy based on filtering followed by uniform sampling.
Enhancing Hallucination Detection through Noise Injection: Injecting uniform noise into MLP activations of intermediate LLM layers to approximate the Bayesian posterior, capturing epistemic uncertainty that is complementary to the aleatoric uncertainty captured by sampling temperature. This raises hallucination detection AUROC on GSM8K from 71.56 to 76.14.
Erase or Hide? Suppressing Spurious Unlearning Neurons for Robust Unlearning: This paper exposes the "shallow alignment" problem in mainstream LLM unlearning methods — rather than truly erasing target knowledge, these methods generate "spurious unlearning neurons" that suppress its expression, allowing the knowledge to be readily recovered via subsequent fine-tuning. The proposed method, Ssiuu, employs attribution-guided regularization to prevent the growth of negative influence, achieving robust unlearning.
Fair in Mind, Fair in Action? A Synchronous Benchmark for Understanding and Generation in UMLLMs: This paper introduces the IRIS Benchmark, the first benchmark to synchronously evaluate fairness in both understanding and generation tasks for Unified Multimodal Large Language Models (UMLLMs). Through a three-dimensional evaluation framework, 60 fine-grained metrics, and a high-dimensional fairness space, IRIS reveals key phenomena such as cross-task "personality splitting" and systematic "generation gaps."
Faithful Bi-Directional Model Steering via Distribution Matching and Distributed Interchange Interventions: This paper proposes Concept DAS (CDAS), which achieves faithful bi-directional model steering through a Jensen-Shannon divergence distribution matching objective and distributed interchange interventions (DII). The method enables systematic behavioral control in safety-critical scenarios—bypassing refusal behaviors and eliminating backdoors—while preserving general model capabilities.
From Static Benchmarks to Dynamic Protocol: Agent-Centric Text Anomaly Detection for Evaluating LLM Reasoning: This paper proposes ATAD (Agent-Centric Text Anomaly Detection), which replaces static benchmarks with a Teacher-Orchestrator-Student three-agent competition and validation loop. Using text anomaly detection as the task format, ATAD achieves self-calibrating, dynamically evolving LLM reasoning evaluation — all evaluated LLMs achieve average accuracies of only 54–59% (far below 90%+ on static benchmarks), effectively exposing reasoning weaknesses.
Gaussian Certified Unlearning in High Dimensions: A Hypothesis Testing Approach: This paper proposes $(\phi,\varepsilon)$-Gaussian certifiability — a high-dimensional machine unlearning privacy framework grounded in hypothesis testing trade-off functions. It rigorously proves that, in the high-dimensional proportional regime ($p \sim n$), a single Newton step combined with calibrated Gaussian noise simultaneously satisfies privacy (GPAR) and accuracy (GED→0) requirements. The work refutes the conclusion of Zou et al. (2025) that "at least two Newton steps are necessary," and theoretically identifies the fundamental incompatibility between the classical $\varepsilon$-certifiability and noise-addition mechanisms.
Heterogeneous Federated Fine-Tuning with Parallel One-Rank Adaptation: This paper proposes Fed-PLoRA, a framework that replaces multi-rank LoRA with multiple parallel one-rank modules (PLoRA). Via a Select-N-Fold strategy—selecting $N$ modules for training and folding the remainder into frozen weights—it achieves zero initialization noise and minimal aggregation noise for heterogeneous federated fine-tuning, outperforming existing methods across 6 LLMs and multiple tasks.
Improving the Trade-off Between Watermark Strength and Speculative Sampling Efficiency for Language Models: This paper proposes a quantitative measure of watermark strength (expected KL divergence) and fully characterizes the Pareto trade-off curve between watermark strength and speculative sampling efficiency. By pseudo-randomizing the acceptance decision, the method simultaneously achieves maximum watermark strength and optimal sampling efficiency.
Inference-Time Backdoors via Hidden Instructions in LLM Chat Templates: This paper exposes LLM chat templates (Jinja2) as a novel inference-time backdoor attack surface. Without modifying model weights, poisoning training data, or controlling inference infrastructure, an adversary can implant conditionally triggered backdoors by modifying only the template within a GGUF file. Attacks are validated across 18 models and 4 inference engines with a success rate exceeding 80%, while completely evading HuggingFace's security scanning.
Inoculation Prompting: Eliciting Traits from LLMs during Training Can Suppress Them at Test-Time: This paper proposes Inoculation Prompting—inserting a system prompt describing an undesired trait (e.g., "You are a malicious, evil assistant") into finetuning data, so the model associates that trait with the prompt rather than learning it globally. Removing the prompt at test time causes the trait to nearly vanish, effectively mitigating Emergent Misalignment, backdoor attacks, and subliminal learning.
LH-Deception: Simulating and Understanding LLM Deceptive Behaviors in Long-Horizon Interactions: This paper proposes LH-Deception, the first simulation framework for LLM deceptive behaviors in long-horizon interactions. It adopts a three-role multi-agent architecture comprising a performer, a supervisor, and a deception auditor, combined with a social-science-theory-driven probabilistic event system. Across 11 frontier models, the framework systematically quantifies deception frequency, severity, type distribution, and trust erosion effects, revealing an emergent "chain of deception" phenomenon that static single-turn evaluations are entirely unable to capture.
Lifelong Learning with Behavior Consolidation for Vehicle Routing: This paper proposes LLR-BC, a framework for lifelong learning in neural VRP solvers. By combining decision-step-level experience buffers, Confidence-aware Experience Weighting (CaEW), and Decision-seeking Behavior Consolidation via reverse KL divergence (DsBC), LLR-BC reduces the Average Performance gap (AP) by an order of magnitude on task sequences with simultaneously shifting distributions and scales, while preserving plasticity for new tasks and improving zero-shot generalization.
Measuring Physical-World Privacy Awareness of Large Language Models: An Evaluation Benchmark: This paper proposes EAPrivacy — the first 4-tier benchmark for evaluating LLM physical-world privacy awareness (400+ procedurally generated scenarios, 60+ physical scenes). It finds that all frontier models exhibit "asymmetric conservatism" (over-cautious on task execution yet insufficient on privacy protection), that enabling reasoning/thinking mode actually degrades privacy performance, and that the best model (Gemini 2.5 Pro) achieves only 59% accuracy in dynamic environments.
Membership Inference Attacks Against Fine-tuned Diffusion Language Models (SAMA): This paper presents the first systematic study of membership inference attack (MIA) vulnerabilities in diffusion language models (DLMs), proposing SAMA: a method that exploits DLMs' bidirectional masking structure to generate exponentially many probing opportunities, and handles sparse, heavy-tailed membership signals via progressive masking, sign voting, and adaptive weighting. SAMA achieves AUC of 0.81 across 9 datasets, outperforming the best baseline by 30%.
OFMU: Optimization-Driven Framework for Machine Unlearning: This work formulates machine unlearning as a bilevel optimization problem: the inner level maximizes the forgetting loss with gradient decorrelation to prevent damage to the retain set, while the outer level minimizes the retain loss with a penalty term enforcing stationary points of the inner objective. On the TOFU benchmark, OFMU simultaneously achieves high forgetting quality and high model utility, outperforming GA/GradDiff/NPO/RMU in terms of forget-retain balance.
Perturbation-Induced Linearization: Constructing Unlearnable Data with Solely Linear Classifiers: This paper proposes PIL, a method that generates unlearnable perturbations using only a bias-free linear classifier as the surrogate model. By inducing linearization in deep models, PIL prevents them from learning semantic features, achieving over 100× speedup compared to existing methods (under 1 minute of GPU time on CIFAR-10).
PMark: Towards Robust and Distortion-free Semantic-level Watermarking with Channel Constraints: PMark is a theoretically distortion-free and paraphrase-robust semantic-level watermarking method for LLMs. It employs cascaded binary filtering over candidate sentences using multiple orthogonal pivot vectors, with median-based sampling to guarantee distortion-freeness. Multi-channel design increases watermark evidence density and enhances robustness. Under paraphrase attacks, TP@FP1% reaches 95%+, outperforming prior SWM methods by 14.8%.
Purifying Generative LLMs from Backdoors without Prior Knowledge or Clean Reference: A backdoor purification method for LLMs that requires neither prior knowledge nor a clean reference model. Mechanistic analysis reveals that backdoor associations are redundantly distributed across MLP layers. Inspired by immunology, the method extracts a "signature" from multiple backdoor variants, localizes and suppresses suspicious neurons, and applies lightweight fine-tuning for recovery. Across 5 attacks × 3 tasks, ASR is reduced by 80%+ while utility is preserved.
Redirection for Erasing Memory (REM): Towards a Universal Unlearning Method for Corrupted Data: This paper proposes a two-dimensional taxonomy for the corrupted data unlearning task (discovery rate × statistical regularity), reveals that existing unlearning methods are each effective only within specific regions of this space, and introduces REM (Redirection for Erasing Memory), which redirects corrupted data into newly added dedicated network capacity before discarding it—achieving strong and consistent unlearning performance across the entire two-dimensional task space for the first time.
RedSage: A Cybersecurity Generalist LLM: This paper introduces RedSage—the first fully open-source cybersecurity generalist LLM—built upon large-scale domain continual pre-training on 11.7B tokens, agentic-augmentation SFT with 266K samples, and RedSage-Bench, the first comprehensive evaluation benchmark covering knowledge, skills, and tools. The resulting 8B-parameter model surpasses same-scale SOTA on cybersecurity benchmarks by +5.4 pp and approaches Qwen3-32B, while simultaneously improving general-purpose performance (+8.4 pp vs. Qwen3-8B).
Resource-Adaptive Federated Text Generation with Differential Privacy: This paper proposes a resource-adaptive federated text generation framework that employs a two-stage design — DP fine-tuning on strong clients and DP voting on weak clients — to generate high-quality synthetic text data under computational heterogeneity and differential privacy constraints.
SABRE-FL: Selective and Accurate Backdoor Rejection for Federated Prompt Learning: This paper is the first to investigate backdoor attack threats in the federated prompt learning (FPL) setting, and proposes SABRE-FL — a lightweight server-side defense based on anomaly detection in the embedding space — which effectively filters poisoned prompt updates without accessing clients' raw data.
SecP-Tuning: Efficient Privacy-Preserving Prompt Tuning for Large Language Models via MPC: This paper proposes SecP-Tuning, the first privacy-preserving prompt tuning framework based on secure multi-party computation (MPC). It eliminates backpropagation overhead via forward-only tuning and reduces communication complexity by replacing softmax with privacy-preserving random feature attention (RFA), achieving approximately 12–16× speedup and 17–20× reduction in communication volume.
SecP-Tuning: Efficient Privacy-Preserving Prompt Tuning for Large Language Models via MPC: This paper proposes SecP-Tuning, the first MPC-based privacy-preserving prompt tuning framework for LLMs. It eliminates backpropagation overhead via forward-only tuning and replaces softmax attention with a privacy-preserving random feature attention mechanism, achieving 12–16× speedup and 17–20× reduction in communication cost.
Self-Destructive Language Model: This paper proposes Seam, which couples the optimization trajectories of benign and harmful data (forcing their gradients into opposite directions) to transform an LLM into a "self-destructive model." Harmful fine-tuning automatically triggers catastrophic performance collapse, creating an inescapable dilemma for attackers: low-intensity attacks are ineffective, while high-intensity attacks render the model unusable.
SHE-LoRA: Selective Homomorphic Encryption for Federated Tuning with Heterogeneous LoRA: This paper proposes SHE-LoRA, which integrates Selective Homomorphic Encryption (SHE) with LoRA for cross-device federated LLM fine-tuning. The framework features sensitivity-based column-level encrypted subset negotiation, column-swap parameter obfuscation, and column-aware adaptive aggregation. It achieves model performance comparable to non-private baselines while reducing communication overhead by 99.71% and encryption time by 99.87%, providing complete resistance against the state-of-the-art gradient inversion attack DAGER.
SHIELD: Suppressing Hallucinations In LVLM Encoders via Bias and Vulnerability Defense: This work is the first to systematically trace object hallucinations in LVLMs back to the visual encoder, identifying three core issues: statistical bias (over-emphasis on high-frequency pattern tokens), inherent bias (residual representations of pre-training dominant objects), and vulnerability (feature distortion under minimal perturbations). It proposes SHIELD—a fully training-free framework that jointly addresses these issues via token reweighting, token subtraction, and contrastive decoding, achieving comprehensive improvements over VCD and OPERA on LLaVA-1.5, InstructBLIP, and Qwen-VL.
Train Once, Answer All: Many Pretraining Experiments for the Cost of One: This paper proposes a methodological framework for running multiple independent experiments simultaneously within a single LLM pretraining run. Training a 2.7B-parameter model on 210B tokens, the framework concurrently executes 10 experiments, successfully replicates the results of 5 prior works, and conducts 3 novel experiments. It further introduces Continual Pretraining Dependence Testing (CPDT) to verify inter-experiment independence.
Tree-based Dialogue Reinforced Policy Optimization for Red-Teaming Attacks (DialTree): This paper proposes DialTree, which frames multi-turn red-teaming as a goal-oriented dialogue policy optimization problem. By employing tree-structured rollouts with quality-based pruning to explore the attack trajectory space, combined with an adaptive mask to prevent format forgetting, DialTree achieves an average ASR of 81.5% across 12 target models—44.2% higher than the previous SOTA—and attains 71% ASR even on Claude-4-Sonnet.
Understanding Sensitivity of Differential Attention through the Lens of Adversarial Robustness: This work is the first to analyze the Differential Attention (DA) mechanism from an adversarial robustness perspective. It reveals that the subtraction structure in DA, while suppressing noise, amplifies sensitivity to adversarial perturbations through negative gradient alignment. The study establishes a "Fragility Principle"—DA improves discriminability on clean samples but becomes more vulnerable under adversarial attacks—and identifies a depth-dependent robustness crossover effect.
Understanding Sensitivity of Differential Attention through the Lens of Adversarial Robustness: This work provides the first adversarial robustness analysis of the structural vulnerability in Differential Attention (DA): while the subtraction mechanism suppresses noise, it amplifies sensitivity to adversarial perturbations due to negative gradient alignment, revealing a fundamental trade-off between selectivity and robustness.
Unlearning Evaluation through Subset Statistical Independence: This paper proposes Split-half Dependence Evaluation (SDE), which leverages HSIC-based statistical independence testing to evaluate machine unlearning at the subset level, requiring neither model retraining nor auxiliary classifiers.
Unmasking Backdoors: An Explainable Defense via Gradient-Attention Anomaly Scoring for Pre-trained Language Models: This paper proposes X-GRAAD, an inference-time backdoor defense that combines attention anomaly scoring and gradient importance scoring to localize trigger tokens, followed by character-level perturbation to neutralize them. Across 5 Transformer models × 3 attack types, ASR is reduced to near 0% while maintaining 88–95%+ CACC, with a 30× speedup over PURE.
Veritas: Generalizable Deepfake Detection via Pattern-Aware Reasoning: This paper proposes Veritas, an MLLM-based deepfake detector that simulates human authentication reasoning via pattern-aware reasoning (fast judgment → reasoning → planning → self-reflection → conclusion). It introduces a two-stage training pipeline (SFT+MiPO cold-start + P-GRPO reinforcement learning) and constructs the HydraFake benchmark with a four-level OOD evaluation protocol. Veritas achieves an average accuracy of 90.7% across cross-forgery and cross-domain scenarios, surpassing the previous SOTA by 6.0%.
VeriTrail: Closed-Domain Hallucination Detection with Traceability: This paper proposes VeriTrail, the first closed-domain hallucination detection method designed for multi-step generation (MGS) pipelines. By modeling the generation process as a DAG and verifying claims layer by layer along the graph, VeriTrail achieves full traceability encompassing hallucination detection, provenance tracking, and error localization. It substantially outperforms all baselines on two newly introduced datasets.
VeriTrail: Closed-Domain Hallucination Detection with Traceability: This paper proposes VeriTrail — the first closed-domain hallucination detection method that provides traceability for multi-generative-step (MGS) processes. It models the generation process as a DAG and performs layer-by-layer verification along paths, while also introducing the first MGS datasets that include all intermediate outputs with human annotations.

📈 Time Series¶

Adapt Data to Model: Adaptive Transformation Optimization for Domain-shared Time Series Foundation Models: This paper proposes TATO, a framework that automatically optimizes data preprocessing pipelines (including context trimming, scale normalization, and outlier correction) to adapt frozen large time-series models (LTMs) to diverse downstream domains without fine-tuning, achieving an average MSE reduction of 13.6% and up to 65.4%.
Brain-Semantoks: Learning Semantic Tokens of Brain Dynamics with a Self-Distilled Foundation Model: This paper proposes Brain-Semantoks, an fMRI foundation model based on a semantic tokenizer and a self-distillation objective. It aggregates functional network signals into robust semantic tokens and learns abstract brain dynamic representations through cross-temporal view consistency, achieving state-of-the-art performance under a linear probing setting.
Contextual and Seasonal LSTMs for Time Series Anomaly Detection: To address "small-magnitude point anomalies" and "slowly rising anomalies" that existing methods struggle to detect in univariate time series, this paper proposes CS-LSTMs, a dual-branch architecture in which S-LSTM models periodic evolution in the frequency domain and C-LSTM captures local trends in the time domain. Combined with a wavelet-based noise decomposition strategy, the method comprehensively outperforms state-of-the-art approaches on four benchmarks while improving inference speed by 40%.
CPiRi: Channel Permutation-Invariant Relational Interaction for Multivariate Time Series Forecasting: This paper proposes the CPiRi framework, which achieves channel permutation-invariant (CPI) cross-channel relational modeling via a frozen pretrained temporal encoder, a lightweight spatial Transformer, and a channel-shuffling training strategy. CPiRi attains state-of-the-art performance on 5 benchmarks with negligible degradation under channel permutation ($\Delta$WAPE < 0.25%).
CPiRi: Channel Permutation-Invariant Relational Interaction for Multivariate Time Series Forecasting: This paper proposes the CPiRi framework, which achieves channel permutation invariance (CPI) without sacrificing cross-channel modeling capability by combining a frozen pretrained temporal encoder, a trainable permutation-equivariant spatial module, and a channel shuffling training strategy. CPiRi achieves state-of-the-art performance on multiple traffic benchmarks.
Delta-XAI: A Unified Framework for Explaining Prediction Changes in Online Time Series Monitoring: This paper proposes Delta-XAI, a unified framework that adapts 14 existing XAI methods to the scenario of explaining prediction changes in online time series monitoring via a wrapper function. It further introduces SWING (Shifted Window Integrated Gradients), which constructs integration paths using past observations to capture temporal dependencies, consistently outperforming existing methods across multiple evaluation metrics.
Dissecting Chronos: Sparse Autoencoders Reveal Causal Feature Hierarchies in Time Series Foundation Models: This work is the first to apply Sparse Autoencoders (SAEs) to a time series foundation model (Chronos-T5-Large), revealing a depth-dependent feature hierarchy through 392 causal ablation experiments: mid-layer encoders concentrate causally critical change-point detection features, whereas the semantically richest final encoder layer exhibits the lowest causal importance.
EDINET-Bench: Evaluating LLMs on Complex Financial Tasks using Japanese Financial Statements: This paper constructs EDINET-Bench, a financial benchmark derived from ten years of Japanese EDINET annual reports, comprising three expert-level tasks—accounting fraud detection, earnings forecasting, and industry classification—and finds that even state-of-the-art LLMs only marginally outperform logistic regression.
Enhancing Multivariate Time Series Forecasting with Global Temporal Retrieval: This paper proposes the Global Temporal Retriever (GTR), a lightweight plug-and-play module that maintains adaptive global period embeddings and leverages absolute time indices to retrieve temporally aligned global periodic information, enabling arbitrary forecasting models to transcend the look-back window constraint and effectively capture global periodic patterns far exceeding the input length.
FeDaL: Federated Dataset Learning for General Time Series Foundation Models: This paper proposes FeDaL, a federated framework that trains a general time series foundation model from scratch via client-side Domain Bias Elimination (DBE) and server-side Global Bias Elimination (GBE), achieving competitive or superior performance across 8 downstream task types with significantly fewer parameters than centralized TSFMs.
Free Energy Mixer: This paper proposes Free Energy Mixer (FEM), which reframes attention value retrieval as a free energy (log-sum-exp) optimization problem, enabling value-aware posterior selection at the per-channel level. FEM addresses the inherent bottleneck of standard attention—lossless storage but lossy reading—and serves as a plug-and-play replacement for softmax/linear attention/RNN/SSM, yielding consistent improvements across NLP, vision, and time series tasks.
From Samples to Scenarios: A New Paradigm for Probabilistic Forecasting: This paper proposes the Probabilistic Scenarios paradigm, in which a model directly outputs a finite set of {scenario, probability} pairs in place of sampling, and introduces TimePrism — a model consisting of only three parallel linear layers — that achieves 9/10 SOTA results across 5 benchmark datasets.
GTM: A General Time-series Model for Enhanced Representation Learning of Time-Series Data: This paper proposes GTM, a general time-series foundation model that captures temporally granularity-aware features via a frequency-domain attention mechanism. Combined with a hybrid masking pre-training strategy, GTM is the first model to support all generative time-series tasks without any task-specific architectural modifications.
GTM: A General Time-series Model for Enhanced Representation Learning: GTM is a general time-series foundation model that captures temporal granularity-aware features via a Fourier attention mechanism and unifies reconstruction and autoregressive pre-training objectives through hybrid masking, achieving state-of-the-art performance across forecasting, imputation, anomaly detection, and classification tasks.
HiVid: LLM-Guided Video Saliency For Content-Aware VOD And Live Streaming: This paper proposes HiVid, the first framework to leverage LLMs as human proxies for generating content importance weights for video chunks. Through a Perception module (sliding-window scoring), a Ranking module (LLM-guided merge sort to eliminate scoring bias), and a Prediction module (multimodal time series forecasting with adaptive latency), HiVid enables content-aware streaming, achieving an 11.5% improvement in VOD PLCC, a 26% gain in live streaming prediction, and a 14.7% improvement in human MOS correlation.
Language in the Flow of Time: Time-Series-Paired Texts Weaved into a Unified Temporal Narrative: This paper identifies that time-series-paired texts exhibit periodicity analogous to that of time series (Chronological Textual Resonance), and proposes the TaTS framework, which transforms text representations into auxiliary variables to enhance the forecasting and imputation performance of arbitrary existing time series models in a plug-and-play manner.
Learning Recursive Multi-Scale Representations for Irregular Multivariate Time Series Forecasting: This paper proposes ReIMTS, a plug-and-play framework that preserves the original sampling patterns of irregular multivariate time series (IMTS) via time-period-based recursive partitioning (rather than resampling), combined with an irregularity-aware representation fusion mechanism for multi-scale modeling. ReIMTS achieves an average improvement of 27.1% across six IMTS backbones.
PAANO: Patch-Based Representation Learning for Time-Series Anomaly Detection: This paper proposes PaAno, a lightweight patch-level representation learning method for time-series anomaly detection. It employs a 1D-CNN encoder trained with triplet loss and pretext loss to learn a patch embedding space, and computes anomaly scores by measuring the distance between query patches and normal patches stored in a memory bank. PaAno achieves comprehensive state-of-the-art performance on the TSB-AD benchmark while requiring only 0.3M parameters and seconds of inference time.
Rating Quality of Diverse Time Series Data by Meta-learning from LLM Judgment: This paper proposes TSRating, a framework that leverages LLMs to perform pairwise quality comparisons of time series (TS) data segments across four dimensions—trend, frequency, amplitude, and pattern. Pairwise judgments are converted to scalar quality scores via the Bradley-Terry model. A TSRater model (MOMENT encoder + MLP) is then trained using MAML meta-learning across 9 domains and 22 subsets, enabling efficient and unified cross-domain TS data quality assessment.
Reasoning on Time-Series for Financial Technical Analysis: This paper proposes the Verbal Technical Analysis (VTA) framework, which combines the linguistic reasoning capabilities of LLMs with the pattern-capturing capacity of time-series models. Time-GRPO reinforcement learning is employed to optimize reasoning chains, and inferred attributes are used to condition time-series forecasting, achieving financial time-series prediction that is both accurate and interpretable.
Relational Feature Caching for Accelerating Diffusion Transformers: This paper proposes Relational Feature Caching (RFC), a framework that enhances the accuracy of cached feature prediction by exploiting the strong correlation between input and output features of DiT modules. RFC comprises two components: RFE, which estimates output change magnitude from input variations, and RCS, which uses input prediction error as a proxy to determine when full computation is required. RFC significantly outperforms existing temporal extrapolation-based caching methods on both image and video generation tasks.
Relational Transformer: Toward Zero-Shot Foundation Models for Relational Data: This paper proposes the Relational Transformer (RT) architecture, which leverages task table prompting, cell tokenization, and a Relational Attention mechanism to enable zero-shot transfer to unseen datasets and tasks after pretraining on multiple relational databases. The 22M-parameter model achieves 93% of fully supervised AUROC in the zero-shot setting, significantly outperforming a 27B LLM at 84%.
ResCP: Reservoir Conformal Prediction for Time Series Forecasting: This work is the first to integrate Reservoir Computing (Echo State Network) into conformal prediction. By using randomly initialized ESNs to encode the temporal dynamics of residual sequences, the method leverages state similarity to adaptively reweight historical residuals for constructing local prediction intervals—requiring no training—and achieves state-of-the-art Winkler scores on 4 real-world datasets while running 20–80× faster than HopCPT.
Routing Channel-Patch Dependencies in Time Series Forecasting with Graph Spectral Decomposition: This paper proposes xCPD, a plug-and-play plugin that refines the modeling unit of multivariate time series from "channels" to "channel-patches." It constructs spectral embeddings via a shared graph Fourier basis, groups nodes into low/mid/high frequency bands based on spectral energy responses, and applies dynamic MoE routing to adaptively select frequency-specific filter experts. xCPD can be seamlessly integrated into any existing CI/CD model to consistently improve both long- and short-term forecasting performance, and supports zero-shot transfer.
SciTS: Scientific Time Series Understanding and Generation with LLMs: This paper proposes SciTS—a scientific time series benchmark spanning 12 scientific domains, 43 tasks, and 54K+ samples—and introduces the TimeOmni framework, which unifies understanding and generation tasks via multi-patch expert routing and an LLM backbone, achieving the best overall performance across the full benchmark.
scits scientific time series understanding and generation with llms: This work proposes the SciTS benchmark covering 43 tasks across 12 scientific domains with 54K+ instances (lengths from $10^0$ to $10^7$, frequencies up to 10 MHz), systematically evaluates 17 models and finds that general-purpose LLMs generalize better than specialized time-series models while text/image encodings each have distinct limitations, and accordingly designs the TimeOmni framework, which employs multi-patch experts with a routing mechanism and patch reprogramming to explicitly model temporal dynamics in joint training with an LLM backbone.
SwiftTS: A Swift Selection Framework for Time Series Pre-trained Models via Multi-task Meta-Learning: SwiftTS is proposed as the first model selection framework for time series pre-trained models. It employs a dual-encoder architecture to independently embed patch-level temporal features of datasets and model meta-information (architecture / topology / function), computes compatibility scores via patch-level cross-attention, and incorporates horizon-adaptive mixture-of-experts together with cross-domain/cross-horizon meta-learning. On 14 datasets × 8 models, it achieves an average weighted Kendall $\tau_\omega = 0.442$, substantially outperforming all baselines.
T1: One-to-One Channel-Head Binding for Multivariate Time-Series Imputation: This paper proposes T1, a CNN-Transformer hybrid architecture whose core innovation is Channel-Head Binding (CHead Attention): a shared depthwise convolution extracts $C$ types of temporal features (trend, periodicity, abrupt changes, etc.) for each variable, and each CNN channel is then bound one-to-one to a single attention head, enabling cross-variable information transfer to proceed independently at the feature level. When missing data prevents a channel from extracting a valid pattern, the corresponding attention head automatically down-weights, achieving adaptive missing-data handling without explicit design. On 11 benchmark datasets, the average MSE is reduced by 46%, with even larger gains under 70% extreme missingness.
Tensor learning with orthogonal, Lorentz, and symplectic symmetries: This paper provides a complete parameterization of equivariant polynomial functions under the diagonal action of the orthogonal group $O(d)$, the indefinite orthogonal group (including the Lorentz group), and the symplectic group $Sp(d)$ on tensors. The framework is applied to design learnable sparse vector recovery algorithms that outperform existing sum-of-squares spectral methods across multiple data-generating assumptions.
Test-Time Efficient Pretrained Model Portfolios for Time Series Forecasting: This paper proposes Chroma — a portfolio framework of small pretrained time series models: frequency/domain expert models are derived from a general model via post-training (achieving 10× training speedup), and at test time predictions are combined through model selection or greedy ensemble. A 4M-parameter portfolio matches the performance of 205M–500M parameter monolithic models on Chronos Benchmark II, while requiring far less inference computation than test-time fine-tuning.
TimeOmni-1: Incentivizing Complex Reasoning with Time Series in Large Language Models: TimeOmni-1 proposes the first unified time series reasoning model, leveraging TSR-Suite (the first reasoning-oriented time series dataset suite) and a two-stage training paradigm (SFT for injecting temporal priors + RL for refining reasoning), achieving significant improvements over GPT-4.1 across multiple time series reasoning tasks.
TimeSliver: Symbolic-Linear Decomposition for Explainable Time Series Classification: TimeSliver is an explainability-driven deep learning framework that jointly leverages raw time series data and symbolic abstractions (binning) to construct representations that preserve the original temporal structure. Each element linearly encodes the contribution of its corresponding temporal segment to the final prediction, yielding per-timestep positive/negative attribution scores. TimeSliver surpasses competing methods by 11% in temporal attribution accuracy across 7 datasets while achieving performance on par with SOTA on 26 UEA benchmarks.
Towards Generalizable PDE Dynamics Forecasting via Physics-Guided Invariant Learning: This paper proposes iMOOE, a framework that explicitly formalizes two-level physical invariance principles — operator invariance and compositional invariance — within PDE systems, and instantiates them via a mixture-of-operator-experts network and a frequency-enhanced risk equalization objective, achieving state-of-the-art zero-shot PDE dynamics forecasting across diverse OOD scenarios without any test-time adaptation.
Towards Robust Real-World Multivariate Time Series Forecasting: A Unified Framework: This paper proposes ChannelTokenFormer (CTF), a unified Transformer framework that simultaneously addresses three core challenges in real-world multivariate time series forecasting: (1) complex inter-channel dependencies — via channel token cross-channel attention; (2) asynchronous sampling across channels — via frequency-domain dynamic patching that preserves original resolution; (3) block-wise missingness at test time — via patch masking during training and direct removal of fully-missing patches at inference. CTF achieves comprehensive state-of-the-art results across six datasets including ETT, SolarWind, Weather, EPA, and CHS.
TSPulse: Tiny Pre-Trained Models with Disentangled Representations for Rapid Time Series: This paper proposes TSPulse, an ultra-lightweight time series pre-trained model with only 1M parameters, which surpasses models 10–100× larger on four tasks — classification (+5–16%), anomaly detection (+20%), imputation (+50%), and similarity retrieval (+25%) — through dual-space masked reconstruction and dual-embedding disentanglement.
TSRating: Rating Quality of Diverse Time Series Data by Meta-learning from LLM Judgment: TSRating leverages the prior knowledge of LLMs to conduct pairwise quality judgments of time series data chunks across four dimensions—trend, frequency, amplitude, and pattern—converts these comparisons into scalar scores via the Bradley-Terry model, and trains a cross-domain generalizable TSRater via meta-learning, enabling efficient and accurate time series data quality assessment.
Tuning the Burn-in Phase in RNN Training Improves Performance: This paper provides a theoretical analysis of the critical role played by the burn-in length $m$ in Truncated Backpropagation Through Time (TBPTT) training of RNNs. It establishes upper bounds on training regret and validates through system identification and time series forecasting experiments that appropriately tuning the burn-in phase can reduce prediction error by more than 60%.
VoT: Event-Driven Reasoning and Multi-Level Alignment Unlock the Value of Text for Time Series Forecasting: This paper proposes VoT, a multimodal time series forecasting method that fully exploits the value of textual information through event-driven reasoning (leveraging LLMs to perform structured reasoning over exogenous text for numerical prediction) and multi-level alignment (representation-level endogenous text alignment + prediction-level adaptive frequency fusion). VoT comprehensively outperforms existing methods on real-world datasets spanning 10 domains.
WARP: Weight-Space Linear Recurrent Neural Networks: This paper proposes WARP (Weight-space Adaptive Recurrent Prediction), which explicitly parameterizes the hidden state of a linear RNN as the weights and biases of an auxiliary MLP. Input differences drive a linear recurrence to update these weights, and a nonlinear decoding step enables efficient sequence modeling. WARP achieves state-of-the-art performance on classification, forecasting, and dynamical system reconstruction tasks.

🔍 Information Retrieval & RAG¶

AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations: This paper proposes AMemGym — the first long-horizon conversational memory benchmark environment supporting on-policy interactive evaluation. It drives LLM-simulated users via structured data sampling (user profile → state evolution → personalized QA), reveals ranking biases inherent in off-policy evaluation, and systematically diagnoses write/read/utilization failure modes across RAG, long-context, and agent-based memory systems.
Attributing Response to Context: A Jensen-Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation: This paper proposes ARC-JSD, a method that computes the Jensen-Shannon Divergence (JSD) between response distributions under full context and sentence-ablated context, enabling efficient and accurate RAG context attribution without fine-tuning, gradient computation, or surrogate models. Combined with Logit Lens for mechanistic analysis, ARC-JSD identifies the attention heads and MLP layers responsible for context attribution, and reduces hallucination rates by approximately 39% via a gating mechanism.
Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation: This paper reformulates positional encoding as prior distributions within a Bayesian attention mechanism, unifying NoPE (uniform prior) and ALiBi (Laplacian prior), and proposes a Generalized Gaussian prior (GGD-BAM) that achieves perfect passkey retrieval at 500× the training length by adding only 384 parameters.
Beyond RAG vs. Long-Context: Learning Distraction-Aware Retrieval for Efficient Knowledge Grounding: This paper proposes LDAR (Learning Distraction-Aware Retrieval), a lightweight adaptive retriever that learns to select passages by sampling a continuous quantile band from the query-passage similarity distribution. LDAR surpasses long-context methods while using approximately half the token budget, balancing information coverage against the influence of distracting passages.
BTZSC: A Benchmark for Zero-Shot Text Classification Across Cross-Encoders, Embedding Models, Rerankers and LLMs: This paper proposes BTZSC, a benchmark comprising 22 datasets, which for the first time systematically compares four model families — NLI cross-encoders, embedding models, rerankers, and instruction-tuned LLMs (38 models in total) — under a unified zero-shot protocol. Qwen3-Reranker-8B achieves a new SOTA with macro F1 = 0.72, while embedding models demonstrate the best accuracy–latency trade-off.
Digging Deeper: Learning Multi-Level Concept Hierarchies: This paper proposes Multi-Level Concept Splitting (MLCS), which extends concept splitting from a single layer to a recursive multi-level process. Using only top-level concept annotations, MLCS automatically discovers concept hierarchy trees of arbitrary depth. The authors further introduce the Deep-HiCEMs architecture to represent and leverage these deep hierarchies, enabling test-time concept interventions at multiple levels of granularity.
Efficient Discriminative Joint Encoders for Large Scale Vision-Language Re-ranking: This paper proposes EDJE (Efficient Discriminative Joint Encoder), which offlines visual feature extraction and compresses visual tokens via a lightweight attention adapter, achieving a throughput of 50k image-text pairs per second. EDJE matches the retrieval performance of existing joint encoders on Flickr (zero-shot) and COCO (fine-tuned), requiring only 49 kB of storage per image.
Embedding-Based Context-Aware Reranker: This paper proposes EBCAR, a lightweight embedding-space reranking framework that injects structural information via document ID embeddings and passage positional encodings. It employs a hybrid mechanism combining shared full attention and dedicated masked attention to enable cross-passage reasoning. EBCAR achieves state-of-the-art average nDCG@10 on the ConTEB benchmark with only 126M parameters, while delivering inference throughput more than 150× faster than LLM-based rerankers.
Fine-tuning with RAG for Improving LLM Learning of New Skills: This paper proposes transforming RAG from a permanent inference-time dependency into a training-time teacher signal. Hints are extracted from agent failures, used to augment a teacher model that generates higher-quality trajectories, and then removed during distillation into a student model. The student thereby internalizes the retrieval-augmented behavior without requiring runtime RAG, achieving a 91% success rate on ALFWorld (baseline: 79%) and a score of 72 on WebShop (baseline: 61).
Flow of Spans: Generalizing Language Models to Dynamic Span-Vocabulary via GFlowNets: This paper proposes FoSS, the first framework to incorporate GFlowNets into span-level language modeling. By constructing a DAG-structured state space in place of the conventional token-by-token tree structure, FoSS enables more flexible and diverse text generation, achieving up to a 12.5% improvement in MAUVE score.
FutureMind: Equipping Small Language Models with Strategic Thinking-Pattern Priors via Adaptive Knowledge Distillation: This paper proposes FutureMind, a training-free framework that distills structured reasoning and retrieval strategies from LLMs into reusable thinking-pattern priors. Through a four-stage pipeline (question analysis → logical reasoning → strategy planning → retrieval guidance) and three retrieval paradigms, FutureMind enables SLMs to achieve state-of-the-art performance on multi-hop QA benchmarks.
G-reasoner: Foundation Models for Unified Reasoning over Graph-structured Knowledge: This paper proposes G-reasoner, which standardizes heterogeneous knowledge sources via a four-layer unified graph interface called QuadGraph, trains a 34M-parameter GNN-based graph foundation model to jointly reason over graph topology and textual semantics, and achieves state-of-the-art performance over existing GraphRAG methods across 6 benchmarks in conjunction with an LLM.
Hierarchical Concept-based Interpretable Models: HiCEMs introduces a hierarchical concept embedding model that automatically discovers fine-grained sub-concepts within the embedding space of a pretrained CEM via Concept Splitting—without requiring additional annotations—thereby constructing a hierarchical concept structure that supports test-time concept interventions at multiple granularities to improve task performance.
HUME: Measuring the Human-Model Performance Gap in Text Embedding Tasks: This paper proposes HUME, a human evaluation framework that systematically measures human performance on 16 MTEB datasets spanning reranking, classification, clustering, and STS tasks. Humans rank 4th overall (77.6 vs. the best model score of 80.1). The study reveals that cases where models surpass human performance tend to occur on tasks with the lowest human agreement, and evaluates 9 LLMs as potential annotation proxies.
Hybrid Deep Searcher: Scalable Parallel and Sequential Search Reasoning: This paper proposes HybridDeepSearcher, which constructs the HDS-QA dataset to train a large language reasoning model (LRM) to distinguish parallelizable from sequentially dependent search queries. The approach achieves F1 gains of +15.9 on FanOutQA and +11.5 on a BrowseComp subset, while substantially reducing inference latency and demonstrating consistent test-time search scaling.
Judge's Verdict: A Comprehensive Analysis of LLM Judge Capability Through Human Agreement: This paper proposes the Judge's Verdict Benchmark—a two-stage evaluation framework based on relevance filtering followed by a Cohen's Kappa human-similarity test—to systematically assess 54 LLM judges. The framework identifies 27 Tier 1 judges (23 human-like and 4 super-consistent). The central finding is that high correlation does not imply high agreement; Kappa combined with z-score is necessary to properly measure LLM judge quality.
Leveraging Data to Say No: Memory Augmented Plug-and-Play Selective Prediction: This paper proposes MA-PaPSP, a training-free plug-and-play selective prediction framework for arbitrary VLMs. It constructs proxy embeddings via k-NN weighted averaging over an external retrieval dataset (reducing representational variance) and applies contrastive normalization scoring (improving calibration). MA-PaPSP consistently outperforms PaPSP and LLM-as-judge baselines on selective prediction across image captioning, image-text matching, and classification tasks.
LightRetriever: A LLM-based Text Retrieval Architecture with Extremely Faster Query Inference: This paper proposes LightRetriever, an extremely asymmetric LLM-based retrieval architecture: the document side retains a full LLM encoder, while the query side eliminates deep modeling entirely — dense retrieval reduces to embedding lookup and averaging, and sparse retrieval reduces to token counting — achieving 1000× query encoding speedup, 10× end-to-end throughput improvement, while retaining 95% of retrieval performance.
Mapping Semantic & Syntactic Relationships with Geometric Rotation: This paper proposes RISE (Rotor-Invariant Shift Estimation), a method that leverages Clifford algebra rotors to represent utterance-level semantic–syntactic transformations (negation, conditionalization, and politeness) as consistent rotation operations on the unit hypersphere. Through systematic experiments across 7 languages × 3 embedding models × 3 transformation types, the paper demonstrates that these rotations transfer across languages and model architectures (77%–95% retention rate), extending the Linear Representation Hypothesis (LRH) from word-level to cross-lingual utterance-level and generalizing it to geodesic structures on curved manifolds.
Multimodal Dataset Distillation Made Simple by Prototype-Guided Data Synthesis: This paper proposes PDS (Prototype-Guided Data Synthesis), the first training-free multimodal dataset distillation framework. It leverages CLIP's aligned embedding space to perform modality-specific clustering, applies the Hungarian algorithm for cross-modal prototype matching, and employs an unCLIP decoder to synthesize distilled images from image prototypes. On a distillation set of as few as 100 pairs, PDS surpasses all optimization-based methods at zero training cost while achieving state-of-the-art cross-architecture generalization.
On the Wings of Imagination: Conflicting Script-based Multi-role Framework for Humor Caption Generation: This paper proposes HOMER, a framework that constructs a three-role LLM collaboration mechanism (conflicting script extractor + hierarchical imaginator + caption generator) grounded in the GTVH theory of verbal humor. By explicitly modeling script opposition, multi-perspective associative chains, and joke database retrieval to build an imagination tree for creative space expansion, HOMER achieves an average improvement of ~7% over baselines on the New Yorker cartoon benchmark using GPT-4o as the backbone, and significantly outperforms all baselines in human evaluation.
Q-RAG: Long Context Multi-Step Retrieval via Value-Based Embedder Training: Multi-step retrieval is formulated as an MDP and solved via value-based RL (soft Q-learning) to fine-tune the embedder rather than the LLM. The Q-function is designed as the inner product of state and action embeddings—proven to be a universal approximator—and combined with RoPE relative positional encoding to enable temporal reasoning. Training requires only a single A100 GPU for 12 hours; models trained on 4K-token contexts generalize to 1M+ token contexts, achieving near-perfect NIAH performance on the RULER benchmark.
Query-Level Uncertainty in Large Language Models: This paper introduces the concept of Query-Level Uncertainty and proposes an Internal Confidence method that estimates, prior to generation (via a single forward pass), whether an LLM is capable of answering a given query. The approach is training-free and enables efficient adaptive inference strategies including RAG triggering, model cascading, and abstention.
RAEE: A Robust Retrieval-Augmented Early Exit Framework for Efficient Inference: This paper proposes RAEE, a retrieval-augmented early exit framework that requires no classifier training. By retrieving exit information from semantically similar samples, RAEE dynamically determines the optimal exit layer, simultaneously accelerating inference and correcting model mispredictions — achieving a dual gain in both efficiency and performance.
RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding: This paper introduces Ravenea, the first benchmark for evaluating multimodal retrieval-augmented cultural understanding. It comprises 1,868 instances and 11,396 human-ranked Wikipedia documents, spanning 11 categories across 8 countries. The benchmark evaluates 7 multimodal retrievers and 17 VLMs, finding that culture-aware RAG yields average improvements of 6% on cVQA and 11% on cIC.
RefTool: Reference-Guided Tool Creation for Knowledge-Intensive Reasoning: This paper proposes RefTool, a framework that automatically creates executable Python tools from external reference materials (e.g., textbooks, knowledge snippets), addressing the failure of existing tool creation methods that rely on LLMs' intrinsic knowledge in specialized domains. RefTool achieves an average improvement of 12.3% over prior methods on causal reasoning, physics, and chemistry tasks.
Retrieval-Augmented Generation for Predicting Cellular Responses to Gene Perturbation: This paper proposes PT-RAG (Perturbation-aware Two-stage Retrieval-Augmented Generation), the first application of differentiable retrieval-augmented generation to single-cell gene perturbation response prediction. The framework combines semantic retrieval of candidate perturbations via GenePT embeddings with Gumbel-Softmax-based conditional discrete sampling for cell-type-aware, end-to-end retrieval optimization. PT-RAG surpasses the STATE baseline on the Replogle-Nadig dataset (Pearson 0.633 vs. 0.624), while demonstrating that naïve RAG severely degrades performance (Pearson 0.396 only), establishing that differentiable, cell-type-aware retrieval is indispensable in this domain.
Revela: Dense Retriever Learning via Language Modeling: This paper proposes Revela, which integrates retriever learning into language modeling via an in-batch attention mechanism. Next-token prediction (NTP) draws not only on within-sequence context but also on other sequences in the batch, weighted by retriever similarity scores, enabling training of a strong dense retriever without labeled query-document pairs.
Summaries as Centroids for Interpretable and Scalable Text Clustering: This paper proposes k-NLPmeans and k-LLMmeans, which periodically replace numeric centroids with textual summaries (summary-as-centroid) during k-means iterations, achieving interpretable cluster prototypes while preserving the standard k-means objective. The number of LLM calls is independent of dataset size.
Token-Guard: Towards Token-Level Hallucination Control via Self-Checking Decoding: This paper proposes Token-Guard, a token-level hallucination control method based on self-checking decoding, which detects and suppresses hallucinations during decoding via token-level/segment-level scoring in the hidden space and an iterative refinement mechanism, achieving an average F1 improvement of 16.3%.
TokMem: One-Token Procedural Memory for Large Language Models: This paper proposes TokMem, which compiles reusable task procedures into single trainable memory tokens that serve simultaneously as procedure indices and generation control signals, enabling efficient invocation of 1,000+ task procedures without long prompts and supporting catastrophic-forgetting-free continual expansion.
Toward Faithful Retrieval-Augmented Generation with Sparse Autoencoders: This paper proposes RAGLens, which leverages sparse autoencoders (SAEs) to disentangle RAG-hallucination-specific features from LLM internal activations, and constructs a lightweight, interpretable hallucination detector via mutual information-based feature selection combined with a Generalized Additive Model (GAM). RAGLens surpasses existing methods across multiple benchmarks and supports token-level interpretable feedback and hallucination mitigation.
Your Language Model Secretly Contains Personality Subnetworks: This paper proposes extracting persona-specific subnetworks from pretrained LLMs via activation-guided pruning, enabling efficient persona switching without any training, and introduces a contrastive pruning strategy to enhance parameter separation between opposing personas.

🎵 Audio & Speech¶

AC-Foley: Reference-Audio-Guided Video-to-Audio Synthesis with Acoustic Transfer: This paper proposes AC-Foley, a reference-audio-guided video-to-audio synthesis framework that achieves fine-grained timbre control, timbre transfer, and zero-shot sound effect generation via two-stage training (acoustic feature learning + temporal adaptation) and multimodal conditional flow matching, significantly outperforming existing methods in audio quality and acoustic fidelity.
AutoFigure: Generating and Refining Publication-Ready Scientific Illustrations: This paper proposes AutoFigure — the first agent framework based on a "Reasoned Rendering" paradigm — which automatically generates publication-ready scientific illustrations from long scientific texts by decoupling structural layout planning and aesthetic rendering into two stages. It is accompanied by FigureBench, the first large-scale benchmark (3,300 pairs) for systematic evaluation, with 66.7% of generated results deemed usable in camera-ready submissions by the original authors.
Discovering and Steering Interpretable Concepts in Large Generative Music Models: This work presents the first application of Sparse Autoencoders (SAEs) to the audio/music domain, extracting interpretable musical concept features from the residual stream of the autoregressive music generation model MusicGen, and leveraging these features for controllable generation (steering).
Dynamic Parameter Memory: Temporary LoRA-Enhanced LLM for Long-Sequence Emotion Recognition in Conversation: This paper proposes Dynamic Parameter Memory (DPM), a mechanism that encodes speech information sentence-by-sentence into the parameter space of a temporary LoRA module during inference, enabling speech large language models (SLLMs) with limited context windows to process arbitrarily long conversational audio. The approach achieves state-of-the-art performance on IEMOCAP and MELD.
EchoMind: An Interrelated Multi-level Benchmark for Evaluating Empathetic Speech Language Models: This paper proposes EchoMind, the first multi-level interrelated benchmark for empathetic dialogue, which systematically evaluates Speech Language Models' ability to perceive non-verbal acoustic cues and generate empathetic responses through a cognitive pipeline of Understanding → Reasoning → Conversation.
Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention: This paper proposes Dolphin, a model that maps lip movements to discrete semantic tokens via a dual-path lightweight video encoder (DP-LipCoder), and introduces a Global-Local Attention (GLA) separator. Dolphin surpasses state-of-the-art methods on three benchmarks while reducing parameters by 50%+, MACs by 2.4×, and GPU inference latency by 6×.
EmotionThinker: Prosody-Aware Reinforcement Learning for Explainable Speech Emotion Reasoning: This work is the first to reformulate Speech Emotion Recognition (SER) as a deep reasoning problem, leveraging a prosody-enhanced backbone model combined with GRPO-PTR (Progressive Trustworthy Reasoning reward) reinforcement learning to generate explainable emotion reasoning grounded in acoustic evidence.
FlexiCodec: A Dynamic Neural Audio Codec for Low Frame Rates: FlexiCodec is proposed as a dynamic frame rate merging strategy guided by ASR features, achieving high-quality speech codec at ultra-low frame rates of 3–12.5 Hz while maintaining superior semantic information retention.
Improving Black-Box Generative Attacks via Generator Semantic Consistency: By analyzing semantic degradation in intermediate-layer features of perturbation generators, this paper proposes a Mean Teacher-based semantic structure-aware framework that performs self-feature distillation at early generator layers to preserve semantic consistency, thereby enhancing the transferability of adversarial examples across models, domains, and tasks.
Incentive-Aligned Multi-Source LLM Summaries: This paper introduces the Truthful Text Summarization (TTS) framework, which incorporates a multi-task peer prediction mechanism from game theory into LLM multi-source summarization pipelines. The approach constructs evaluation claim sets via leave-one-out cross-referencing, extracts each source's stance on individual claims, scores source reliability using informative agreement, filters unreliable sources, and regenerates the summary. The framework is theoretically proven to make truthful reporting a utility-maximizing strategy, and empirically demonstrates robustness against prompt injection, misinformation sources, and coordinated attacks.
Knowing When to Quit: Probabilistic Early Exits for Speech Separation: This paper proposes PRESS (Probabilistic Early-exit for Speech Separation) and the PRESS-Net architecture. By jointly modeling clean speech signals and error variance within a probabilistic framework, PRESS derives an interpretable early-exit criterion based on signal-to-noise ratio (SNR), enabling fine-grained dynamic computation scaling for speech separation networks while maintaining performance competitive with static SOTA models.
Latent Speech-Text Transformer: This paper proposes the Latent Speech-Text Transformer (LST), which aggregates discrete speech tokens into higher-level "latent speech patches" as autoregressive units (analogous to BLT's treatment of bytes), aligning the sequence modeling granularity of speech and text (reducing the length ratio from 20× to ~1:1). LST achieves +6.5% absolute improvement on Speech HellaSwag, with gains that continue to grow from 420M to 7B parameters, while reducing ASR/TTS inference computation.
LogicReward: Incentivizing LLM Reasoning via Step-Wise Logical Supervision: This paper proposes LogicReward, a reward function that employs the Isabelle theorem prover for step-wise logical correctness verification. Combined with Autoformalization with Soft Unification to reduce natural language ambiguity, the trained 8B model surpasses GPT-4o by 11.6% and o4-mini by 2% on NLI and logical reasoning tasks.
MAPSS: Manifold-Based Assessment of Perceptual Source Separation: This paper proposes two complementary metrics—Perceptual Separation (PS) and Perceptual Match (PM)—that embed self-supervised encoded representations onto a low-dimensional manifold via diffusion maps, achieving for the first time a functional decoupling of leakage and self-distortion in source separation evaluation. Compared against 18 mainstream metrics, the proposed measures rank first or second in correlation with subjective listening scores in nearly all experimental conditions.
MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark: This paper introduces MMSU (5,000 audio QA items across 47 tasks), the first benchmark to systematically incorporate linguistic theory into spoken language understanding and reasoning evaluation. Evaluating 22 SpeechLLMs, it reveals significant gaps in phonological perception and complex reasoning among existing models.
PACE: Pretrained Audio Continual Learning: This paper presents the first systematic benchmark for audio continual learning (CL), identifies a fundamental upstream–downstream mismatch in pretrained audio models caused by the dominance of low-level spectral features, and proposes PACE—comprising improved first-session adaptation, adaptive subspace-orthogonal PEFT, and boundary-aware perturbation regularization—achieving substantial improvements over SOTA across 6 audio CL benchmarks.
Pay Attention to CTC: Fast and Robust Pseudo-Labelling for Unified Speech Recognition: This paper proposes USR 2.0, which replaces autoregressive pseudo-label generation with CTC-driven teacher forcing, enabling attention pseudo-labels to be produced in a single forward pass. The approach achieves nearly 2× training speedup, enhances out-of-distribution robustness via joint CTC-attention prediction, and establishes state-of-the-art results on LRS3/LRS2/WildVSR across all three tasks (ASR/VSR/AVSR) within a single unified model.
Query-Guided Spatial-Temporal-Frequency Interaction for Music Audio-Visual Question Answering: This paper proposes QSTar, a framework that embeds query guidance throughout the entire processing pipeline and introduces a three-dimensional Spatial-Temporal-Frequency Interaction module (leveraging spectral features to distinguish timbres), achieving significant performance gains on Music Audio-Visual Question Answering (Music AVQA).
RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments: This paper presents RedTeamCUA, the first red-teaming framework for computer-use agents (CUAs) in hybrid Web-OS environments, along with RTC-Bench comprising 864 test cases. The framework systematically evaluates the vulnerability of 9+ frontier CUAs to indirect prompt injection attacks, finding that all evaluated CUAs are exploitable (peak ASR of 83%). Notably, more capable models pose greater risks — the large gap between attempt rate (AR) and attack success rate (ASR) implies that improvements in model capability will directly translate into higher attack success rates.
Scalable Multilingual Multimodal Machine Translation with Speech-Text Fusion: This paper proposes a Speech-guided Machine Translation (SMT) framework that synthesizes source-language speech via TTS and jointly feeds it with text into an MLLM for translation. A self-evolution mechanism automatically selects beneficial synthetic speech samples for continual training. The approach achieves state-of-the-art performance on Multi30K, surpassing all MMT methods, and attains average SOTA across 108 translation directions on FLORES-200 with only 9B parameters.
Scaling Speech Tokenizers with Diffusion Autoencoders: This paper proposes SiTok (Speech Diffusion Tokenizer), which employs a diffusion autoencoder to jointly train the encoder–quantizer–decoder in a single stage (rather than two stages), incorporates CTC-based semantic regularization to ensure discrete tokens retain linguistic information, and scales to 1.6B parameters trained on 22 million hours of speech data. SiTok achieves strong performance at an extremely low token rate (12.5 Hz / 200 bps), attaining 3.34% WER (reconstruction) and 4.95 WER (LLM ASR) simultaneously.
SiNGER: A Clearer Voice Distills Vision Transformers Further: This paper proposes SiNGER (Singular Nullspace-Guided Energy Reallocation), a framework that suppresses high-norm artifacts in ViT features by applying perturbations along the left-nullspace directions of teacher features, thereby preserving informative signals. Combined with lightweight LoRA adapters, SiNGER achieves state-of-the-art performance across multiple downstream tasks while producing cleaner and more interpretable representations.
SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables: This paper presents SPARTA, an end-to-end framework for automatically constructing large-scale table-text multi-hop QA benchmarks. By leveraging a reference fact database, provenance-based refinement, and realistic structural constraints to generate high-quality nested SQL queries, SPARTA reduces the F1 of state-of-the-art models by over 30 points.
Statistical Guarantees for Offline Domain Randomization: This paper formalizes offline domain randomization (ODR) as a maximum likelihood estimation problem over a parameterized family of simulators. Under mild regularity and identifiability assumptions, it establishes weak consistency (convergence in probability); with an additional uniform Lipschitz continuity assumption, strong consistency (almost sure convergence) is further proved. These results provide the first theoretical foundation for the empirical success of ODR in sim-to-real transfer.
Stitch: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models: Stitch enables "thinking while speaking" in spoken language models (SLMs) by interleaving silent reasoning tokens with speech tokens in chunks, exploiting idle compute during audio playback for reasoning. Stitch-S achieves first-chunk latency identical to the no-reasoning baseline while improving math reasoning accuracy by approximately 15 percentage points.
SyncTrack: Rhythmic Stability and Synchronization in Multi-Track Music Generation: SyncTrack is proposed with a unified architecture comprising track-shared modules (dual cross-track attention for rhythmic synchronization) and track-specific modules (learnable instrument priors for timbre preservation), along with three new rhythmic consistency evaluation metrics (IRS/CBS/CBD), achieving substantial improvements in multi-track music generation quality (FAD: 6.55→1.26, subjective MOS: 3.42 vs. 1.57).
The Devil behind the Mask: An Emergent Safety Vulnerability of Diffusion LLMs: This paper presents the first systematic investigation of inherent safety vulnerabilities in diffusion large language models (dLLMs) arising from their bidirectional modeling and parallel decoding mechanisms. It introduces the DiJA jailbreak attack framework, which achieves near-100% attack success rates on multiple aligned dLLMs via interleaved mask-text prompts.
Toward Complex-Valued Neural Networks for Waveform Generation: This paper proposes ComVo, the first iSTFT vocoder to employ complex-valued neural networks (CVNNs) in both the generator and discriminator. It stabilizes training via a phase quantization layer and introduces a block-matrix computation scheme that reduces training time by 25%, achieving synthesis quality superior to real-valued baselines such as Vocos on LibriTTS.
TripleSumm: Adaptive Triple-Modality Fusion for Video Summarization: This paper proposes TripleSumm, which achieves dynamic frame-level modality importance adjustment via a Multi-scale Temporal block (hierarchical sliding-window attention) and a Cross-modal Fusion block (fusion token adaptively weighting visual/text/audio). The authors also release MoSu, the first large-scale triple-modality video summarization dataset (52,678 videos), achieving SOTA on 4 benchmarks.
VowelPrompt: Hearing Speech Emotions from Text via Vowel-level Prosodic Augmentation: This paper proposes VowelPrompt, which extracts vowel-level prosodic descriptors (pitch/energy/duration) grounded in phonetic evidence, converts them into natural language to augment LLM emotion recognition prompts, and employs a two-stage SFT+GRPO training pipeline. The method consistently outperforms state-of-the-art approaches under zero-shot, fine-tuning, cross-domain, and cross-lingual conditions, while generating interpretable emotion reasoning.
When and Where to Reset Matters for Long-Term Test-Time Adaptation: ASR proposes an adaptive selective reset scheme that uses prediction concentration $\mathcal{C}_t$ to dynamically determine when to reset (avoiding the suboptimality of fixed-period resets), and employs a progressive layer selection strategy from output to input layers to determine where to reset (preserving valuable adaptation knowledge). Combined with importance-aware regularization for recovering critical knowledge in reset layers and on-the-fly adaptation adjustment, ASR achieves a 44.12% improvement over the prior SOTA on CCC-Hard.
When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment: This paper identifies that Attack Success Rates (ASR) in LLM jailbreak benchmarks are artificially inflated by semantically irrelevant style patterns (e.g., "create a list of"), a phenomenon observed in nearly all of 36 evaluated LLMs. Superficial style alignment fine-tuning further exacerbates this risk. The paper proposes SafeStyle — a defense that mitigates this risk via style-augmented safety training data.

🛡️ AI Safety¶

Action-Free Offline-to-Online RL via Discretised State Policies: This paper formalises the "action-free offline-to-online RL" setting for the first time and proposes the OSO-DecQN algorithm. By discretising continuous state differences into ternary tokens $\{-1, 0, 1\}$, a state policy $Q(s, \Delta s)$ is pretrained on action-free $(s, r, s')$ tuples to predict the expected direction of next-state change rather than actions. A policy-switching mechanism combined with an online-trained inverse dynamics model (IDM) then translates the state policy into executable actions, guiding online agents to accelerate learning. The approach consistently improves both convergence speed and asymptotic performance on D4RL and DeepMind Control Suite (including 78-dimensional state spaces).
Adaptive Methods Are Preferable in High Privacy Settings: An SDE Perspective: This work is the first to analyze differentially private optimizers through a stochastic differential equation (SDE) framework, revealing fundamental behavioral differences between DP-SGD and DP-SignSGD under privacy noise: adaptive methods achieve a superior privacy-utility tradeoff of $\mathcal{O}(1/\varepsilon)$ vs. $\mathcal{O}(1/\varepsilon^2)$ in high-privacy regimes, and their hyperparameters transfer across privacy budgets.
ATEX-CF: Attack-Informed Counterfactual Explanations for Graph Neural Networks: This paper proposes ATEX-CF, a framework that, for the first time, unifies the edge-addition strategy from adversarial attacks with the edge-deletion strategy from counterfactual explanations. Through joint optimization of prediction flipping, sparsity, and plausibility, ATEX-CF generates more faithful, concise, and plausible instance-level counterfactual explanations for GNNs.
Back to Square Roots: An Optimal Bound on the Matrix Factorization Error for Multi-Epoch Differentially Private SGD: This paper proposes the Banded Inverse Square Root (BISR) matrix factorization method, which imposes a banded structure on the inverse correlation matrix (rather than on the correlation matrix itself). This approach achieves, for the first time, an asymptotically optimal factorization error bound for multi-epoch differentially private SGD, and is accompanied by a memory-efficient variant, BandInvMF.
Beware Untrusted Simulators -- Reward-Free Backdoor Attacks in Reinforcement Learning: This paper proposes Daze, a backdoor attack in which a malicious simulator developer—without any access to or modification of the agent's reward function—plants a backdoor solely by manipulating state transitions: when the agent fails to execute the target action in a trigger state, it is forced to take random actions ("dazed"), thereby theoretically guaranteeing both attack success and stealthiness. The work also presents the first demonstration of an RL backdoor attack on real robot hardware.
Beyond Match Maximization and Fairness: Retention-Optimized Two-Sided Matching: This paper proposes Matching for Retention (MRet), an algorithm that, for the first time, shifts the optimization objective of two-sided matching platforms from "maximizing the number of matches" or "satisfying fairness constraints" to "directly maximizing user retention rate." By learning personalized retention curves and exploiting the concavity of the retention function, the otherwise NP-hard joint retention-gain optimization for both sides is reduced to an $O(N \log N)$ sorting problem. MRet achieves significant retention improvements on both synthetic data and real-world data from a large Japanese dating platform.
Bridging Fairness and Explainability: Can Input-Based Explanations Promote Fairness in Hate Speech Detection?: The first systematic large-scale quantitative study on the relationship between input-based explanations and fairness: explanations can effectively detect biased predictions and serve as training regularizers to reduce bias, but cannot be reliably used for automatic fair model selection.
Co-LoRA: Collaborative Model Personalization on Heterogeneous Multi-Modal Clients: This paper proposes FedMosaic, a framework addressing dual heterogeneity in personalized federated learning (PFL): RELA measures task relevance via gradient similarity to enable customized aggregation (addressing data heterogeneity), while Co-LoRA enables cross-architecture knowledge sharing (e.g., Llama vs. Qwen) through dimension-invariant modules $P \in \mathbb{R}^{r \times r}, Q \in \mathbb{R}^r$ (addressing model heterogeneity). The framework achieves substantial improvements over SOTA on DRAKE, a newly proposed 40-task multimodal PFL benchmark.
Dataless Weight Disentanglement in Task Arithmetic via Kronecker-Factored Approximate Curvature: This work elegantly bridges classical curvature approximation theory (KFAC) with the practical demands of task arithmetic, proposing a data-free weight disentanglement regularization method. The theoretical derivation is clear, with a coherent logical chain from representation drift regularization → Jacobian Gramian → GGN → KFAC. Experiments span multiple model scales across both vision and language domains, and the robustness analysis with respect to the $\alpha$ hyperparameter is practically valuable. Limitations include the $O(d^2)$ storage overhead of KFAC for large models and a remaining gap relative to data-dependent methods in the text domain.
Efficient Resource-Constrained Training of Transformers via Subspace Optimization: This paper proposes WASI (Weight-Activation Subspace Iteration), which leverages the observation that parameter subspaces remain stable during fine-tuning to simultaneously compress both the weights (via SVD + Gram-Schmidt subspace iteration) and activations (via Tucker decomposition) of Transformers. Both training and inference are performed entirely within low-rank representations, achieving 62× training memory compression and 1.4× speedup on Raspberry Pi 5 with negligible accuracy loss.
Extending Sequence Length is Not All You Need: Effective Integration of Multimodal Signals for Gene Expression Prediction: This paper challenges the prevailing "longer is better" paradigm in gene expression prediction, demonstrating that current SSM models fundamentally rely only on proximal information. It further identifies background chromatin signals (DNase-seq/Hi-C) as confounding variables that introduce spurious correlations, and proposes the Prism framework, which applies backdoor adjustment for deconfounding—achieving state-of-the-art performance with only 2k-length sequences, surpassing methods that use 200k-length sequences.
Hide and Find: A Distributed Adversarial Attack on Federated Graph Learning: This paper proposes FedShift, a two-stage "hide-and-find" distributed adversarial attack framework. In the first stage, covert shifters are injected into training graphs via gentle distributional shifts. In the second stage, the trained shifter generator serves as a warm initialization for efficiently searching adversarial perturbations, which are then aggregated across multiple malicious clients to form the final adversarial examples. FedShift achieves state-of-the-art attack success rates on six large-scale datasets, evades three mainstream defense algorithms, and improves convergence speed by over 90%.
Learnability and Privacy Vulnerability are Entangled in a Few Critical Weights: This paper reveals that privacy vulnerability is concentrated in a remarkably small fraction of weights (as few as 0.1%), which is highly entangled with learnability (Pearson $r > 0.9$). The proposed CWRF method achieves superior privacy-utility trade-offs by rewinding privacy-vulnerable weights to their initialization and freezing them, while fine-tuning only the remaining weights.
Less is More: Towards Simple Graph Contrastive Learning: This paper revisits the foundational principles of graph contrastive learning (GCL) and identifies that node feature noise can be mitigated through structural feature aggregation derived from graph topology. Based on this insight, the authors propose a minimalist GCL model that contrasts a GCN encoder (capturing structural features) against an MLP encoder (isolating node feature noise), requiring neither data augmentation nor negative sampling. The method achieves state-of-the-art performance on heterophilic graph benchmarks while offering advantages in complexity, scalability, and robustness on homophilic graphs.
Risk-Sensitive Agent Compositions: This paper formalizes agent workflows as directed acyclic graphs (Agent Graphs), models safety/fairness/privacy requirements via a max loss function, and proposes the BucketedVaR algorithm, which combines union bounds with dynamic programming to find the optimal agent composition minimizing VaR/CVaR in polynomial time. The approach is proven to be asymptotically near-optimal under an independence assumption on agent losses.
Robust Spiking Neural Networks Against Adversarial Attacks: This paper theoretically demonstrates that threshold-proximal spiking neurons are the key robustness bottleneck in directly trained SNNs — they simultaneously set the theoretical upper bound on adversarial attack strength and are most susceptible to state flipping. The proposed Threshold Guarding Optimization (TGO) method addresses this through a dual strategy of membrane potential constraint and noisy LIF neurons, achieving state-of-the-art robustness across multiple adversarial attack scenarios with zero additional inference overhead.
Membership Privacy Risks of Sharpness Aware Minimization: This paper presents the first systematic study demonstrating that models trained with SAM (Sharpness-Aware Minimization), despite achieving better generalization, are more vulnerable to membership inference attacks (MIA) than SGD-trained models. Two complementary explanations are provided through theoretical analysis and experiments: memorization behavior and variance contraction.
Sample-Efficient Distributionally Robust Multi-Agent Reinforcement Learning via Online Interaction: This paper presents the first study of online learning in Distributionally Robust Markov Games (DRMGs), proposing the MORNAVI algorithm. Without relying on a simulator or offline data, MORNAVI efficiently learns optimal robust policies through online interaction, and provides the first provable regret bounds under both TV-divergence and KL-divergence uncertainty sets.
Skirting Additive Error Barriers for Private Turnstile Streams: This paper demonstrates that in the differentially private turnstile streaming model, allowing multiplicative error circumvents known polynomial additive error lower bounds, reducing the additive error for distinct elements and $F_2$ moment estimation from polynomial to $\mathrm{polylog}(T)$.
Skirting Additive Error Barriers for Private Turnstile Streams: This paper proves that the polynomial pure additive error lower bounds in differentially private turnstile streams—$\Omega(T^{1/4})$ for distinct elements counting and $\Omega(T)$ for $F_2$ moment estimation—can be circumvented by introducing multiplicative error. The paper achieves $(\text{polylog}(T), \text{polylog}(T))$ mixed error for distinct elements and $(1+\eta, \text{polylog}(T))$ mixed error for $F_2$ moments, both requiring only polylogarithmic space.
Time Is All It Takes: Spike-Retiming Attacks on Event-Driven Spiking Neural Networks: This paper proposes the Spike-Retiming Attack — a temporal attack that perturbs only spike timestamps without adding or removing spikes. It formalizes a unified tri-norm budget ($\mathcal{B}_\infty$ local jitter / $\mathcal{B}_1$ total delay / $\mathcal{B}_0$ tamper count) under a capacity-1 constraint, and employs Projected-in-the-Loop (PIL) optimization to decouple strict forward projection from soft backward differentiation. The method achieves >90% ASR with <2% spike perturbation on CIFAR10-DVS/DVS-Gesture/N-MNIST, revealing a critical temporal vulnerability in event-driven SNNs.
Toward Enhancing Representation Learning in Federated Multi-Task Settings: This paper proposes the Muscle loss — an N-tuple-level multi-model contrastive learning objective whose minimization is equivalent to maximizing a lower bound on the mutual information among all model representations. Building on this, the FedMuscle algorithm aligns the representation spaces of heterogeneous models via a public dataset, naturally handling both model and task heterogeneity. FedMuscle consistently outperforms state-of-the-art baselines across CV/NLP multi-task settings, with gains of up to +28.65%.
Traceable Black-box Watermarks for Federated Learning: This paper proposes TraMark, which partitions the model parameter space into a main-task region and a watermark region and employs masked aggregation to prevent watermark collision. TraMark achieves server-side traceable black-box watermark injection in federated learning for the first time, attaining a verification rate of 99.58% with only a 0.54% drop in main-task accuracy.
Unified Privacy Guarantees for Decentralized Learning via Matrix Factorization: This paper unifies diverse algorithms and trust models in decentralized learning (DL) under a matrix factorization (MF) mechanism framework, extends privacy guarantees to more general matrix types, and proposes the MAFALDA-SGD algorithm that significantly outperforms existing methods on both synthetic and real-world graph topologies by optimizing noise correlation.
VPI-Bench: Visual Prompt Injection Attacks for Computer-Use Agents: This paper introduces VPI-Bench, the first comprehensive visual prompt injection attack benchmark (306 samples), systematically evaluating the security of Computer-Use and Browser-Use Agents across 5 platforms. Results reveal that Browser-Use Agents are critically vulnerable (100% AR on Amazon/Booking), that even Anthropic's CUA exhibits severe vulnerabilities (up to 59% AR), and that system prompt defenses are ineffective.
Watermark-based Detection and Attribution of AI-Generated Content: This paper presents the first systematic study on watermark-based user-level detection and attribution of AI-generated content. It provides theoretical analysis (bounds on TDR/FDR/TAR), an efficient watermark selection algorithm (A-BSTA), and cross-modal (image + text) experimental validation, demonstrating that detection and attribution inherit the accuracy and (non-)robustness of the underlying watermarking method.
Why Do Unlearnable Examples Work: A Novel Perspective of Mutual Information: This paper provides a unified explanation for the effectiveness of all unlearnable example (UE) methods through the lens of mutual information (MI) reduction, and proves that minimizing the intra-class covariance of poisoned features reduces the MI upper bound. Based on this framework, MI-UE is proposed, which achieves covariance reduction via intra-class cosine similarity maximization, suppressing test accuracy to 9.95% on CIFAR-10 (near random-chance), while significantly outperforming existing methods under adversarial training defenses.

📚 Pretraining¶

A Law of Data Reconstruction for Random Features (and Beyond): This paper establishes a data reconstruction law in random feature models from information-theoretic and algebraic perspectives: when the parameter count $p \gg dn$ (where $d$ is the data dimension and $n$ is the number of samples), training data can be fully reconstructed. A projection-loss-based optimization method is proposed and the universality of this threshold is validated on RF models, two-layer networks, and ResNets.
Block-Sample MAC-Bayes Generalization Bounds: This paper proposes block-sample MAC-Bayes generalization bounds (mean approximately correct) that partition the training data into $J$ blocks and replace the monolithic KL divergence with a sum of per-block conditional KL divergences. The resulting bounds remain finite and meaningful even in settings where the original PAC-Bayes bounds are vacuous (e.g., deterministic learning algorithms such as mean estimation). The paper further establishes that a high-probability (PAC) version of these bounds is generally unattainable.
CHAMMI-75: Pre-training multi-channel models with heterogeneous microscopy images: This work introduces CHAMMI-75—the largest heterogeneous multi-channel microscopy image pre-training dataset (2.8M images, 75 sources, 25 channel types, 16 species)—and demonstrates that imaging modality diversity is the key factor for improving generalization of multi-channel models. The trained MorphEm model achieves state-of-the-art performance on 6 out of 7 benchmarks.
Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training: This work constructs Common Corpus — the largest legally licensed LLM pre-training dataset at approximately 2 trillion tokens — spanning 6 major collections (government, culture, science, code, web, and semantic), covering multiple languages including low-resource ones. All data originates from copyright-free or permissively licensed sources, accompanied by complete data provenance and a multi-stage filtering pipeline. The dataset has been adopted by industry leaders including Anthropic.
Deconstructing Positional Information: From Attention Logits to Training Biases: This paper proposes a unified analytical framework based on Toeplitz matrices, categorizing positional encodings into additive (Absolute/T5/ALiBi) and multiplicative (RoPE) types. Through synthetic tasks, it reveals that RoPE exhibits significant advantages on position-sensitive tasks but suffers from a "single-head deposit pattern" in shallow layers, where nearly all positional reasoning concentrates in a single attention head. The paper further provides a theoretical proof that this pattern is an intrinsic property of RoPE's multiplicative structure.
Emergent Misalignment is Easy, Narrow Misalignment is Hard: Fine-tuning on narrow-domain harmful data induces broad misalignment (emergent misalignment) because "general misalignment" constitutes a simpler and more efficient solution in parameter space than "misalignment confined to a specific domain"—the general solution exhibits smaller parameter norm and greater robustness to perturbations.
Explaining Grokking and Information Bottleneck through Neural Collapse Emergence: This work provides a unified explanation of two prominent late-stage training phenomena—Grokking (delayed generalization) and the Information Bottleneck compression phase—through the lens of Neural Collapse. It proves that the contraction of population within-class variance is the common key factor underlying both phenomena, and reveals that training loss convergence and the onset of Neural Collapse operate on distinct timescales governed by weight decay.
FictionalQA: A Dataset for Studying Memorization and Knowledge Acquisition: This work introduces the FictionalQA dataset and generation pipeline, which synthesizes webtext-style documents and QA pairs about fictional events to study both factual memorization and verbatim memorization in LLM training under controlled conditions. Key findings show that greater surface-form diversity facilitates knowledge acquisition, while concise structured lists are least conducive to generalization.
Identifying and Evaluating Inactive Heads in Pretrained LLMs: This paper systematically evaluates 12 scoring functions for identifying inactive attention heads in LLMs, finding that the attention head output norm-based scoring function (AHON LN) more consistently identifies inactive heads across model families than traditional attention weight metrics. On average across 14 models, over 12% of heads can be zeroed out while maintaining MMLU accuracy within 1%.
Imagine How To Change: Explicit Procedure Modeling for Change Captioning: ProCap reframes change captioning from static image-pair comparison to dynamic procedure modeling. In the first stage, a procedure encoder is trained via frame interpolation and masked reconstruction to capture spatiotemporal change dynamics; in the second stage, learnable process queries implicitly infer the change procedure, surpassing state-of-the-art methods on three benchmarks.
Implicit Bias and Loss of Plasticity in Matrix Completion: Depth Promotes Low-Rank: By analyzing the gradient flow dynamics of deep matrix factorization (deep linear networks) in matrix completion, this paper proves that coupled dynamics is the key mechanism underlying the low-rank implicit bias of deep networks, and that networks of depth $\geq 3$ inevitably exhibit coupling except under diagonal initialization. This provides a theoretical explanation for why deep models are able to avoid loss of plasticity.
Intrinsic Training Dynamics of Deep Neural Networks: This paper investigates when the trajectory in parameter space under gradient flow training of deep neural networks can be "lifted" to a low-dimensional intrinsic space and expressed as an intrinsic Riemannian gradient flow. It proposes an intrinsic recoverability criterion based on conservation laws and extends the results to ReLU networks and linear networks of arbitrary depth.
Lossless Vocabulary Reduction for Auto-Regressive Language Models: This paper proposes a theoretical framework for Lossless Vocabulary Reduction (LVR), which converts any auto-regressive language model into an exactly equivalent model operating over an arbitrary sub-vocabulary via nested tokenization. Building on the Maximal Common Vocabulary (MCV), the framework enables efficient ensembling of language models with heterogeneous tokenization schemes, with effectiveness validated on GSM8K, MATH, translation, and other tasks.
MoMa: A Simple Modular Deep Learning Framework for Material Property Prediction: MoMa is a modular material property prediction framework that trains task-specific modules across multiple tasks and stores them centrally in a MoMa Hub, then applies a training-free Adaptive Module Composition (AMC) algorithm driven by representation similarity to assemble customized models for downstream tasks, achieving an average improvement of 14% over the strongest baseline across 17 datasets.
Polynomial, trigonometric, and tropical activations: This paper systematically explores learnable activation function families based on orthogonal bases (Hermite polynomials, Fourier trigonometric basis) and tropicalization, addressing the gradient explosion/vanishing problem of polynomial activations via variance-preserving initialization, and successfully replacing GELU in GPT-2 and ConvNeXt to enable stable training.
Pre-training LLM without Learning Rate Decay Enhances Supervised Fine-Tuning: This paper proposes the Warmup-Stable-Only (WSO) learning rate schedule—completely eliminating the decay phase during pre-training. Despite yielding worse pre-training metrics, WSO consistently outperforms all decay-based schedules after SFT. Loss landscape analysis reveals that WSO's advantage stems from maintaining flatter minima.
Predicting Training Re-evaluation Curves Enables Effective Data Curriculums: This paper proposes the Training Re-evaluation Curve (TREC) as a diagnostic tool that analyzes the loss of a fully trained model evaluated on training data at each timestep, thereby guiding optimal placement of high-quality data. The paper further demonstrates that the shape of TREC can be predicted via the implicit EMA coefficient of AdamW, enabling curriculum design without any actual training runs.
RECON: Robust symmetry discovery via Explicit Canonical Orientation Normalization: This paper proposes RECON, a class- and pose-agnostic canonical orientation normalization method that corrects arbitrary canonical representations produced during training via a simple right translation, enabling unsupervised instance-level symmetry discovery, OOD pose detection, and a plug-and-play test-time canonicalization layer.
Reducing Class-Wise Performance Disparity via Margin Regularization: This paper proposes MR2 (Margin Regularization for performance disparity Reduction), which dynamically adjusts class-dependent margins in both the logit and representation spaces. Grounded in theoretically derived generalization bounds, MR2 reduces class-wise performance disparity while simultaneously improving overall accuracy.
SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook: This paper proposes SemHiTok — a tokenizer that unifies visual understanding and generation via a Semantic-Guided Hierarchical Codebook (SGHC): pixel sub-codebooks are constructed on top of a pretrained semantic codebook, with structure and training fully decoupled (stage-wise optimization) to avoid the semantic–pixel conflict in joint training. Under the LLaVA setting, SemHiTok achieves state-of-the-art performance in both understanding and reconstruction among discrete tokenizers.
Steering Language Models with Weight Arithmetic: This paper proposes Contrastive Weight Steering, which extracts behavioral direction vectors from the weight difference between models fine-tuned on positive and negative behavioral data, and directly modifies model weights to control behavior. The method demonstrates superior generalization and consistency compared to Activation Steering across experiments on sycophancy, malicious behavior, and refusal.
Stochastic Self-Organization in Multi-Agent Systems: This paper proposes SelfOrg, a framework that dynamically constructs directed acyclic communication graphs (DAGs) based on semantic similarity of agent responses and Shapley value contribution estimates, enabling self-organized collaboration in multi-agent systems. The approach is particularly effective in weak-model settings.
TASTE: Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling: This paper proposes TASTE (Text-Aligned Speech Tokenization and Embedding), which aligns speech tokens with text transcriptions via a cross-attention mechanism, enabling high-quality speech reconstruction at an extremely low bitrate (~150 bps). This design makes text-speech joint modeling straightforward and efficient; the resulting 1.3B-parameter TASLM outperforms 7B pretrained SLMs.
Token-level Data Selection for Safe LLM Fine-tuning: This paper proposes TOSS (Token-level data Selection for Safe LLM fine-tuning), the first token-level data selection framework that evaluates the safety risk of each token via the loss difference between a safety-degraded reference model and a utility-oriented reference model, achieving a superior safety-utility tradeoff compared to sample-level methods.
Understanding and Improving Shampoo and SOAP via Kullback-Leibler Minimization: This paper reinterprets the structured second-order moment estimation of Shampoo and SOAP through the lens of KL divergence minimization, reveals their inherent limitations, and proposes two practical methods—KL-Shampoo and KL-SOAP—that match or surpass the original methods without requiring Adam grafting.
Understanding the Emergence of Seemingly Useless Features in Next-Token Predictors: This paper explains, from the perspective of gradient signals, why Transformers trained with next-token prediction (NTP) learn features that appear "useless" for predicting the immediate next token. It proposes a decomposition of gradient pathways into three components — direct learning, pre-caching, and circuit sharing — and validates this framework on toy tasks, OthelloGPT, and language models.
Understanding the Emergence of Seemingly Useless Features in Next-Token Predictors: By decomposing training gradient signals into three components — direct, pre-cached, and circuit sharing — this paper explains why Transformers trained with NTP learn features that appear "useless" for predicting the current next token. The framework is validated on OthelloGPT, small language models, and a pre-trained LLM (Gemma 2).

📹 Video Understanding¶

AdAEM: An Adaptively and Automated Extensible Measurement of LLMs' Value Difference: This paper proposes AdAEM, an adaptive and self-extensible evaluation framework for LLM values. By leveraging information-theoretic optimization, AdAEM automatically generates test questions that maximally reveal value differences across LLMs, addressing the "insufficient informativeness" limitation of existing static benchmarks that fail to distinguish models' value orientations.
A.I.R.: Adaptive, Iterative, and Reasoning-based Frame Selection For Video Question Answering: This paper proposes A.I.R., a training-free adaptive-iterative-reasoning-driven frame selection framework that addresses two fundamental challenges in VideoQA—inaccurate similarity estimation by lightweight models (CLIP) and the prohibitive computational cost of VLM-based analysis—via a two-stage strategy: GMM-based adaptive initial sampling followed by iterative VLM-guided refinement. In the worst case, A.I.R. analyzes only 72 frames (vs. 128 for baselines), while consistently improving performance across multiple long-video benchmarks.
AnveshanaAI: A Multimodal Platform for Adaptive AI/ML Education through Automated Question Generation and Interactive Assessment: This paper presents AnveshanaAI, an adaptive AI/ML education platform grounded in Bloom's cognitive taxonomy. The system employs fine-tuned GPT-2 for automated question generation, semantic similarity-based deduplication, explainable AI (XAI) techniques for transparency, and gamification mechanisms (points/badges/leaderboards) to deliver a personalized learning assessment system spanning seven domains from data science to multimodal AI. Experiments demonstrate a significant reduction in perplexity after fine-tuning and a notable improvement in learner engagement.
Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss: This paper proposes Expert-Router Coupling (ERC) Loss, a lightweight auxiliary loss function that treats router parameter rows as proxy tokens for cluster centroids and constrains expert activation norms with respect to them, achieving tight coupling between router decisions and expert capabilities. The method requires only $n^2$ activation computations and yields significant performance gains in MoE-LLMs.
Decoding Open-Ended Information Seeking Goals from Eye Movements in Reading: This paper introduces a novel task of decoding open-ended information seeking goals from eye movement trajectories during reading. Built upon the OneStop eye-tracking dataset (360 participants, 486 questions, 162 passages), the authors develop both discriminative and generative multimodal models. RoBERTEye-Fixations achieves 49.3% accuracy on three-way goal selection (random baseline: 33%) and 70.9% on different-critical-span conditions; DalEye-Llama/GPT also significantly outperforms eye-movement-free baselines on goal reconstruction.
Emergence of Superposition: Unveiling the Training Dynamics of Chain of Continuous Thought: This paper theoretically analyzes the training dynamics of a two-layer Transformer trained with continuous Chain-of-Thought (Coconut) on the directed graph reachability problem, revealing how a "superposition" mechanism naturally emerges: the index-matching logit first grows and then remains bounded, thereby achieving a balance between exploration and exploitation.
FlashVID: Efficient Video Large Language Models via Training-free Tree-Based Spatiotemporal Token Merging: This paper proposes FlashVID, a training-free inference acceleration framework for video large language models (VLLMs) that jointly models spatial and temporal redundancy via Tree-based Spatiotemporal Token Merging (TSTM). Retaining only 10% of visual tokens, FlashVID preserves 99.1% of LLaVA-OneVision's performance and enables a 10× increase in input frames for Qwen2.5-VL.
FLoC: Facility Location-Based Efficient Visual Token Compression for Long Video Understanding: This paper proposes FLoC, a visual token compression framework based on the facility location function. Through submodular optimization, FLoC efficiently selects a token subset that is both representative and diverse under a given budget, enabling training-free, model-agnostic, and query-agnostic token compression for long video understanding.
From Vicious to Virtuous Cycles: Synergistic Representation Learning for Unsupervised Video Object-Centric Learning: This paper identifies a vicious cycle between the encoder (producing sharp but noisy attention maps) and the decoder (producing spatially consistent but blurry reconstruction masks) in slot-based object-centric learning. It proposes a synergistic contrastive learning objective paired with a slot regularization warm-up strategy to convert this vicious cycle into a virtuous one, achieving substantial improvements in object discovery performance on MOVi and YouTube-VIS.
Let's Split Up: Zero-Shot Classifier Edits for Fine-Grained Video Understanding: This paper introduces a new task called Category Splitting, which exploits latent compositional structure embedded in video classifier weights to decompose coarse-grained action categories into fine-grained subcategories under zero-shot conditions, without retraining or additional data.
Log Probability Tracking of LLM APIs: This paper proposes Logprob Tracking (LT), a method that detects subtle changes in LLM APIs (e.g., single-step fine-tuning) using only the log probabilities of a single-token input and single-token output. LT achieves sensitivity 2–3 orders of magnitude higher than existing methods at 1000× lower cost.
LUMINA: Detecting Hallucinations in RAG System with Context-Knowledge Signals: This paper proposes the Lumina framework for detecting hallucinations in RAG systems via "context-knowledge signals": MMD is used to measure external context utilization, while cross-layer token prediction evolution measures internal knowledge utilization, enabling hyperparameter-free generalization.
Mamba-3: Improved Sequence Modeling using State Space Principles: Three core improvements are proposed from an SSM perspective: exponential-trapezoidal discretization, complex-valued state spaces, and multi-input multi-output (MIMO) formulation. These advances significantly improve model quality and state-tracking capability without increasing decoding latency, pushing the performance–efficiency Pareto frontier forward.
Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs: This work presents the first systematic reverse-engineering of temporal reasoning in VideoLLMs using mechanistic interpretability tools (Attention Knockout + Logit Lens), uncovering a three-stage information flow blueprint—"early-to-mid-layer cross-frame interaction → mid-layer video-language integration → mid-to-late-layer answer generation"—and demonstrating that retaining only 42% of attention edges preserves VideoQA performance with negligible degradation.
NerVE: Nonlinear Eigenspectrum Dynamics in LLM Feed-Forward Networks: This paper proposes NerVE, a lightweight eigenspectrum analysis framework that systematically reveals, via four complementary metrics (Spectral Entropy, Participation Ratio, Eigenvalue Early Enrichment, and JS Divergence), how FFN nonlinearities in LLMs re-inject variance, reshape the eigenspectrum, and how architectural and optimizer choices imprint distinct spectral signatures.
Online Time Series Prediction Using Feature Adjustment: This paper proposes ADAPT-Z (Automatic Delta Adjustment via Persistent Tracking in Z-space), which shifts the adaptation objective in online time series forecasting from model parameter updates to feature space correction. A lightweight adapter fuses current features with historical gradients to address delayed feedback in multi-step forecasting. ADAPT-Z consistently outperforms existing online learning methods across 13 datasets.
Paper Copilot: Tracking the Evolution of Peer Review in AI Conferences: Paper Copilot is constructed as a persistent digital archive and analysis platform for peer reviews spanning dozens of AI/ML venues. It adopts a tri-source hybrid data collection strategy—OpenReview API, web scraping, and community contributions—to archive real-time score snapshots capturing pre- and post-rebuttal dynamics. The platform reveals a structural anomaly in ICLR 2025: a counterintuitive decline in decision entropy, signaling a shift from probabilistic tiering to near-deterministic score-driven decision-making. LLM-driven author–affiliation metadata extraction further supports talent trajectory tracking.
Stabilizing Policy Gradients for Sample-Efficient Reinforcement Learning in LLM Reasoning: This paper proposes CAPO (Curvature-Aware Policy Optimization), which models second-order optimization geometry at the LM head's final layer to predict and filter token updates that would cause policy collapse. Under aggressive hyperparameters (5× learning rate, 1/12 batch size), CAPO maintains training stability and achieves a 30× sample efficiency improvement over standard GRPO on MATH.
Stop Tracking Me! Proactive Defense Against Attribute Inference Attack in LLMs: TRACE-RPS proposes a unified defense framework against attribute inference attacks in LLMs: TRACE leverages attention mechanisms and reasoning chains to precisely locate privacy-leaking text elements for fine-grained anonymization, while RPS employs lightweight suffix optimization to induce model refusal of inference, reducing attribute inference accuracy from ~50% to below 5%.
The Expressive Limits of Diagonal SSMs for State-Tracking: This paper provides a complete characterization of the expressive power of input-dependent complex diagonal (DCD) SSMs on group state-tracking tasks: a single layer cannot track any non-abelian group, and $k$ layers can track group $G$ if and only if $G$ admits a subnormal series of length $k$ with abelian factors — precisely defining the strict expressiveness gains conferred by depth. Experiments further reveal a significant gap between expressive capacity and learnability.
FuncBenchGen: A Contamination-Free Controllable Evaluation Framework for Reliable Benchmarking: This paper proposes FuncBenchGen, a framework that models multi-step function calling as a DAG traversal problem, enabling contamination-free and finely controllable evaluation of LLM tool-use capabilities. The framework further reveals critical failure modes of reasoning models under long call chains and connected irrelevant functions.
Video-KTR: Enhancing Video Reasoning via Key Token Attribution: This paper proposes Video-KTR, a modality-aware policy shaping framework that identifies three categories of key tokens—visual-aware, temporal-aware, and entropy-aware—via counterfactual analysis, and applies selective reinforcement learning updates exclusively to these tokens, achieving state-of-the-art performance across multiple video reasoning benchmarks (Video-Holmes 42.7%, surpassing GPT-4o).
VideoNSA: Native Sparse Attention Scales Video Understanding: This paper proposes VideoNSA, which introduces Native Sparse Attention (NSA) into video-language models. Through a mixed sparse attention mechanism combining compression, selection, and sliding window branches with dynamic gating, VideoNSA achieves 128K-token video understanding using only 3.6% of the attention budget, surpassing token compression and training-free sparse attention baselines on long video understanding, temporal reasoning, and spatial understanding tasks.
Robustness and Radioactivity of Watermarks in Federated Learning May Be at Odds: This work presents the first study on LLM watermark-based data provenance in federated learning (FL). It demonstrates that watermarks are radioactive (detectable) in FL, yet a malicious server can suppress watermark signals by employing strong robust aggregation algorithms to filter watermarked updates, revealing a fundamental trilemma among radioactivity, robustness, and model utility.

💻 Code Intelligence¶

A Problem-Oriented Perspective and Anchor Verification for Code Optimization: This paper proposes a problem-oriented (rather than user-oriented) approach to constructing optimization pairs that integrates the strategic diversity of multiple programmers, and designs an anchor verification framework that leverages "slow but correct code" to generate test cases for mitigating the "optimization tax" (correctness loss), improving the optimization rate from 31.24% to 71.06% and the speedup ratio from 2.95x to 6.08x.
Ambig-SWE: Interactive Agents to Overcome Underspecificity in Software Engineering: This paper introduces Ambig-SWE, an underspecified variant of SWE-Bench Verified, and systematically evaluates LLM coding agents across three dimensions of interactive capability—detecting underspecification, formulating clarification questions, and leveraging obtained information. Results show that interaction can improve resolution rates in underspecified settings by up to 74%, yet models default to non-interactive behavior and struggle to distinguish between well-specified and underspecified instructions.
Breaking the SFT Plateau: Multimodal Structured Reinforcement Learning for Chart-to-Code Generation: To address the SFT performance plateau in chart-to-code generation, this paper proposes Multimodal Structured Reinforcement Learning (MSRL), which employs a dual-layer textual and visual reward function along with a two-stage RL strategy, achieving improvements of 6.2% and 9.9% on high-level metrics on ChartMimic and ReachQA respectively, establishing open-source SOTA and matching GPT-4o.
CARD: Towards Conditional Design of Multi-agent Topological Structures: CARD proposes a Conditional Agentic Graph Designer framework that adaptively designs multi-agent communication topologies based on dynamic environment signals—including model capability changes, tool availability, and knowledge source updates—via a conditional variational graph encoder and environment-aware optimization. The approach consistently outperforms static and prompt-based baselines on HumanEval, MATH, and MMLU.
DiaBlo: Diagonal Blocks Are Sufficient For Finetuning: This paper proposes DiaBlo—a parameter-efficient fine-tuning method that replaces low-rank decomposition with diagonal block updates. The weight matrix is partitioned into $N \times N$ blocks, and only the diagonal blocks $\mathbf{D}_1, \ldots, \mathbf{D}_N$ are trained. This approach entirely bypasses the non-convex optimization, initialization sensitivity, and gradient instability introduced by the $\mathbf{AB}$ product in LoRA. Zero initialization suffices for convergence, and the method requires only a single torch.einsum batched matmul in PyTorch. Theoretical analysis proves that DiaBlo is strictly more expressive than LoRA under the same parameter budget. DiaBlo achieves state-of-the-art results across commonsense reasoning, arithmetic reasoning, code generation, and safety alignment, as well as 4-bit/2-bit quantization settings.
DRO-InstructZero: Distributionally Robust Prompt Optimization for Large Language Models: This work integrates distributionally robust optimization (DRO) into a Bayesian optimization framework for zero-shot instruction optimization, enabling optimized instructions to maintain reliable performance under distribution shift and adversarial evaluation conditions.
DRO-InstructZero: Distributionally Robust Prompt Optimization for Large Language Models: This work integrates distributionally robust optimization (DRO) into the Bayesian optimization (BO) framework of InstructZero. By maximizing the worst-case expected utility over an ambiguity set defined by an f-divergence ball, the automatically searched prompts maintain reliable performance under distribution shift.
Execution-Grounded Credit Assignment for GRPO in Code Generation: This paper proposes EGCA (Execution-Grounded Credit Assignment), which leverages execution traces to localize the earliest semantic deviation in a program and concentrates GRPO gradients on the causal token span, addressing the coarse-grained credit assignment problem in code generation. EGCA achieves 82.1% pass@1 on HumanEval.
Improving Code Localization with Repository Memory: By leveraging a repository's commit history to construct episodic memory (past commits) and semantic memory (summaries of active code functionality), this work enhances the code localization capability of language agents, achieving significant improvements on SWE-bench.
IMSE: Intrinsic Mixture of Spectral Experts Fine-tuning for Test-Time Adaptation: This paper proposes IMSE, which decomposes the linear layers of a pretrained ViT via SVD into "spectral experts" and adapts only the singular values for extremely parameter-efficient test-time adaptation. Combined with a diversity maximization loss and a domain-aware spectral code retrieval mechanism, IMSE achieves state-of-the-art performance across three settings: TTA, CTTA, and progressive CTTA.
Inference-Time Safety for Code LLMs via Retrieval-Augmented Revision: This paper proposes SOSecure, a training-free inference-time safety mechanism that retrieves relevant community security warnings from a Stack Overflow knowledge base via BM25, guiding the model to autonomously revise unsafe code during inference. SOSecure achieves up to 96.7% vulnerability fix rate with zero new vulnerability introductions across three real-world datasets.
InnoGym: Benchmarking the Innovation Potential of AI Agents: This paper proposes InnoGym, the first benchmark and framework for systematically evaluating the innovation potential of AI agents. It introduces two complementary metrics—Performance Gain and Novelty—and, through 18 improvable tasks, finds that current agents exhibit a degree of innovativeness but lack the robustness to reliably translate novel ideas into performance improvements.
KV Cache Transform Coding for Compact Storage in LLM Inference: This paper proposes KVTC, a KV cache compression method inspired by classical media compression techniques (PCA-based feature decorrelation + adaptive quantization + entropy coding). KVTC achieves up to 20× compression (40×+ in specific scenarios) on Llama 3, Mistral NeMo, and R1-Qwen 2.5, outperforming baselines including token eviction, quantization, and SVD-based methods.
Learning to Reason without External Rewards: This paper proposes Intuitor, an RLIF method that replaces external verifiable rewards with the model's own self-certainty (the KL divergence between the output distribution and a uniform distribution). Intuitor matches GRPO performance on mathematical reasoning while exhibiting superior generalization to out-of-domain tasks such as code generation.
MathFimer: Enhancing Mathematical Reasoning by Expanding Reasoning Steps through Fill-in-the-Middle Task: Drawing inspiration from the Fill-in-the-Middle (FIM) paradigm in code completion, this work trains a dedicated step-expansion model, MathFimer-7B, to insert finer-grained intermediate reasoning steps into existing mathematical solution chains, thereby systematically improving the mathematical reasoning capability of downstream models.
Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning: This work proposes PaperCoder — a multi-agent LLM framework that automatically converts machine learning papers into executable code repositories via a three-stage pipeline: Planning, Analysis, and Coding. 88% of the generated repositories are rated as best by the original paper authors, and the framework substantially outperforms baselines on the PaperBench benchmark.
ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory: This paper proposes ReasoningBank, a memory framework that distills generalizable reasoning strategies from both successful and failed experiences as judged by the agent itself, and introduces memory-aware test-time scaling (MaTTS) to establish a synergy between memory and test-time scaling. The approach consistently outperforms baselines on WebArena, Mind2Web, and SWE-Bench (up to 34.2% relative improvement) while reducing interaction steps by 16%.
Sharing State Between Prompts and Programs: This paper proposes the shared program state abstraction, enabling prompts to directly read and write program variables, manipulate heap objects, and control program flow. The abstraction is realized in the Nightjar system (Python + prompt hybrid programming), achieving a 39.6% reduction in code size while maintaining or improving accuracy (+4–19%).
ShieldedCode: Learning Robust Representations for Virtual Machine Protected Code: This paper proposes ShieldedCode — the first protection-aware code representation learning framework. By introducing hierarchical dependency modeling (three levels: intra-instruction, preceding-instruction, and inter-instruction) and joint functional-aware and protection-aware contrastive learning, the framework enables LLMs to generate, compare, and reason about VM-protected code. ShieldedCode surpasses existing methods on both VM code generation (Pass@1 26.95% vs. GPT-4o 22.58%) and binary similarity detection.
Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning: This paper proposes Supervised Reinforcement Learning (SRL), which reframes problem solving as a step-wise action generation process. By leveraging dense rewards based on sequence similarity, SRL enables small models to learn from expert trajectories on difficult reasoning problems that neither SFT nor RLVR can effectively handle.
The Limits of Long-Context Reasoning in Automated Bug Fixing: This paper systematically evaluates the limits of current LLMs in long-context code debugging. It finds that the success of agentic workflows stems from task decomposition rather than long-context reasoning (successful trajectories consume only 20–30K tokens), while performance degrades sharply under 64K single-pass patch generation (GPT-5-nano achieves 0%), revealing a significant gap between nominal context length and actual usable context capacity.
Training Large Language Models To Reason In Parallel With Global Forking Tokens: This paper proposes Set Supervised Fine-Tuning (SSFT), which aligns global forking tokens with diverse reasoning trajectories via bipartite matching, enabling LLMs to globally steer distinct reasoning modes from a single control token. SSFT substantially outperforms standard SFT and GRPO on mathematical reasoning and code generation tasks.
Training Large Language Models to Reason in Parallel with Global Forking Tokens: This paper proposes Set Supervised Fine-Tuning (SSFT), which introduces global forking tokens and a set-based loss via bipartite matching to train LLMs to produce diverse and correct reasoning patterns triggered by a single control token, outperforming standard SFT+GRPO on both Pass@1 and Cons@k.

🕸️ Graph Learning¶

A Geometric Perspective on the Difficulties of Learning GNN-based SAT Solvers: This paper proves, from the geometric perspective of graph Ricci curvature, that the bipartite graph representation of random k-SAT instances exhibits inherent negative curvature that decreases as problem difficulty increases. It establishes a theoretical connection between GNN oversquashing and SAT solving difficulty, and validates the theory through test-time graph rewiring.
Are We Measuring Oversmoothing in Graph Neural Networks Correctly?: This paper identifies that the widely adopted Dirichlet energy metric fails to correctly capture oversmoothing in GNNs under practical settings. It proposes the numerical/effective rank (Erank) of the feature representation matrix as an alternative measure. Empirically, Erank achieves an average correlation of 0.91 with accuracy (vs. 0.72 for Dirichlet energy), while on OGB-Arxiv, Dirichlet energy even exhibits an incorrect correlation direction. The paper further provides theoretical proofs that the numerical rank converges to 1 (rank collapse) for a broad family of GNN architectures, and redefines oversmoothing as rank collapse rather than feature vector alignment.
Beyond Simple Graphs: Neural Multi-Objective Routing on Multigraphs: This paper presents GMS, the first neural combinatorial optimization routing method for multigraphs, comprising two variants: GMS-EB, which performs edge-level autoregressive construction directly on the multigraph, and GMS-DH, a dual-head approach that learns to prune the multigraph before performing node-level routing. GMS achieves near-LKH performance on asymmetric multi-objective TSP and CVRP while being tens of times faster.
Cooperative Sheaf Neural Networks: This paper proposes in/out-degree sheaf Laplacians defined on directed graphs for cellular sheaves, and constructs a Cooperative Sheaf Neural Network (CSNN) that enables nodes to independently select information propagation/reception strategies, thereby simultaneously mitigating oversquashing and handling heterophilic tasks.
Embodied Agents Meet Personalization: Investigating Challenges and Solutions Through the Lens of Memory Utilization: This paper systematically evaluates the memory utilization capabilities of LLM-driven embodied agents through the Memento framework. It finds that existing agents can recall simple object semantics but fail to process sequential information in user behavior patterns. A hierarchical knowledge graph-based user profile memory module is proposed to effectively improve performance on personalized assistance tasks.
Entropy-Guided Dynamic Tokens for Graph-LLM Alignment in Molecular Understanding: This paper proposes EDT-Former (Entropy-guided Dynamic Token Transformer), which establishes efficient alignment between a frozen graph encoder and a frozen LLM via an entropy-guided dynamic token generation mechanism. Without fine-tuning the LLM backbone, EDT-Former achieves state-of-the-art performance across multiple benchmarks including molecular question answering, molecular instruction following, and property prediction.
Explore-on-Graph: Incentivizing Autonomous Exploration of LLMs on Knowledge Graphs: This paper proposes Explore-on-Graph (EoG), which leverages SFT followed by two-phase reinforcement learning (outcome reward + path-refined reward) to incentivize LLMs to autonomously explore reasoning paths on knowledge graphs beyond the training distribution, surpassing GPT-5 and Gemini 2.5 Pro on five KGQA benchmarks.
GRAPHITE: Graph Homophily Booster — Reimagining the Role of Discrete Features in Heterophilic Graph Learning: This paper proposes GRAPHITE, a learning-free graph transformation method that directly boosts graph homophily by introducing "feature nodes" as hubs to indirectly connect nodes sharing common features. It is the first approach to address heterophilic graph learning by modifying graph structure rather than redesigning GNN architectures, achieving substantial improvements over 27 state-of-the-art methods on challenging benchmarks such as Actor.
Graph Tokenization for Bridging Graphs and Transformers: This paper proposes GraphTokenizer, a framework that converts graphs into symbol sequences via invertible frequency-guided serialization, then applies BPE to learn a substructure vocabulary, enabling standard Transformers (e.g., BERT/GTE) to process graph data directly without any architectural modification, achieving state-of-the-art results on 14 benchmarks.
GraphUniverse: Synthetic Graph Generation for Evaluating Inductive Generalization: This paper proposes GraphUniverse, a framework that generates graph families with persistent semantic communities via a hierarchical architecture, enabling for the first time a systematic evaluation of inductive generalization in graph learning models. A key finding is that transductive performance cannot reliably predict inductive generalization ability.
Learning Concept Bottleneck Models from Mechanistic Explanations: This paper proposes Mechanistic CBM (M-CBM), which leverages Sparse Autoencoders to extract concepts from features learned by a black-box model, names and annotates them via a multimodal LLM, and constructs an interpretable Concept Bottleneck Model. Under controlled information leakage, M-CBM substantially outperforms existing CBM approaches.
LogicXGNN: Grounded Logical Rules for Explaining Graph Neural Networks: LogicXGNN proposes a post-hoc framework for extracting interpretable first-order logical rules from trained GNNs. The framework identifies predicates via graph structural hashing and hidden-layer embedding pattern recognition, determines discriminative DNF rule structures using decision trees, and grounds abstract predicates into the input space. The resulting rule-based classifier can serve as a substitute for the original GNN and also functions as a controllable graph generation model.
MolLangBench: A Comprehensive Benchmark for Language-Prompted Molecular Structure Recognition, Editing, and Generation: This paper introduces MolLangBench, a benchmark constructed via automated tools and expert annotation to provide high-quality, unambiguous evaluation datasets for the molecule-language interface. It covers three task types (recognition / editing / generation) and three modalities (SMILES / image / graph), evaluates 16+ commercial LLMs and 5 chemistry-specific models, and reveals that even GPT-5 falls significantly short on basic molecular operations (generation accuracy only 43%).
On the Expressive Power of GNNs for Boolean Satisfiability: This paper rigorously proves, from the perspective of the Weisfeiler-Leman (WL) test, that the complete WL hierarchy cannot distinguish satisfiable from unsatisfiable 3-SAT instances, revealing the theoretical expressiveness limits of GNNs for SAT solving. It also identifies positive instance families—such as planar SAT and random SAT—where GNNs can successfully distinguish satisfiability.
Pairwise is Not Enough: Hypergraph Neural Networks for Multi-Agent Pathfinding: This paper proposes HMAGAT, which replaces the pairwise message passing of GNNs with a directed hypergraph attention network to model group interactions in multi-agent pathfinding, surpassing a 85M-parameter SOTA model using only 1M parameters and 1% of the training data.
RAS: Retrieval-And-Structuring for Knowledge-Intensive LLM Generation: This paper proposes RAS, a framework that dynamically constructs query-specific knowledge graphs at inference time for each input question. Through three stages—iterative retrieval planning, text-to-triple conversion, and graph-augmented answering—RAS achieves structured reasoning and attains improvements of up to 7.0% and 8.7% over prior methods on 7 knowledge-intensive benchmarks for open-source and closed-source LLMs, respectively.
Relational Graph Transformer: This paper proposes RelGT, the first graph Transformer specifically designed for relational databases. Through multi-element tokenization (a 5-tuple of feature/type/hop distance/time/local structure encodings) and a local–global hybrid attention mechanism, RelGT consistently outperforms GNN baselines across all 21 tasks in the RelBench benchmark, with improvements of up to 18%.
Relatron: Automating Relational Machine Learning over Relational Databases: This work systematically compares relational deep learning (RDL/GNN) and deep feature synthesis (DFS) on predictive tasks over relational databases, finding that neither dominates uniformly and performance is highly task-dependent. The authors propose Relatron — a task-embedding-based meta-selector that leverages RDB task homophily and affinity embeddings for automatic architecture selection, achieving up to 18.5% improvement in joint architecture–hyperparameter search.
Revisiting Node Affinity Prediction in Temporal Graphs: This paper analyzes why simple heuristics (persistent forecasting, moving average) consistently outperform complex TGNNs on temporal graph node affinity prediction. It proves that these heuristics are special cases of linear SSMs and that standard RNNs/LSTMs/GRUs cannot express even the most basic persistent forecasting. Based on these findings, the paper proposes NAViS — a linear SSM architecture with a virtual global state and a ranking loss — which surpasses all baselines on TGB benchmarks.
Structurally Human, Semantically Biased: Detecting LLM-Generated References with Embeddings and GNNs: By constructing paired citation graphs (human vs. GPT-4o-generated vs. random baseline) for 10,000 papers, this work finds that LLM-generated reference lists are nearly indistinguishable from human ones in terms of graph topology (RF accuracy only 60%), yet are effectively detectable via semantic embeddings (RF 83%, GNN 93%). This indicates that LLMs accurately mimic citation topology while leaving detectable semantic fingerprints.
Towards Improved Sentence Representations using Token Graphs: This paper proposes Glot, a lightweight structure-aware pooling module that constructs a latent similarity graph from token-level hidden states of a frozen LLM, refines them via a GNN, and aggregates them into a sentence representation. Glot achieves competitive performance with fine-tuning-based methods on GLUE/MTEB while requiring 20× fewer parameters and 100× faster training.

⚡ LLM Efficiency¶

Did You Check the Right Pocket? Cost-Sensitive Store Routing for Memory-Augmented Agents: This paper formalizes multi-store retrieval in memory-augmented agents as a cost-sensitive store routing problem, demonstrates that selective retrieval can reduce context tokens by 62% while improving QA accuracy (86% vs. 81%) over exhaustive retrieval, and proposes a semantics-based heuristic routing baseline.
DND: Boosting Large Language Models with Dynamic Nested Depth: DND selects critical tokens at the end of each Transformer layer via a router and routes them back through the same layer for additional processing (nested depth). Combined with a routing control loss and a threshold control scheme for precise and stable token selection, DND achieves average performance gains of 1.88% and 0.87% on Qwen3-1.7B and Qwen3-30B-A3B, respectively, with fewer than 0.1M additional parameters.
EvoEngineer: Mastering Automated CUDA Kernel Code Evolution with Large Language Models: This paper proposes EvoEngineer, the first systematic LLM-based code evolution framework that decomposes code evolution into two orthogonal components — traverse technique (with a two-layer design: solution guiding + prompt engineering) and population management. On 91 real-world CUDA kernels, EvoEngineer achieves a median speedup of up to 2.72× and a code validity rate of 69.8%, outperforming existing methods on both performance and correctness.
Expert Divergence Learning for MoE-based Language Models: This paper addresses the expert homogenization problem in MoE training by maximizing the Jensen-Shannon divergence of routing distributions across different data domains, encouraging distinct expert subsets to be activated for different domains. The approach improves expert specialization and language modeling performance on a 15B-A1.5B model.
Fast Catch-Up, Late Switching: Optimal Batch Size Scheduling via Functional Scaling Laws: This paper derives the optimal batch size scheduling (BSS) strategy under a Functional Scaling Law (FSL) framework. For hard tasks, the optimal strategy is to train with small batches for most of the budget and switch to large batches only at the final stage (late switching). The paper further reveals a fast catch-up effect—after switching, the loss rapidly converges to the trajectory of full large-batch training—and validates these principles in LLM pretraining at 1.1B parameters and 1T tokens.
IterResearch: Rethinking Long-Horizon Agents with Interaction Scaling: IterResearch is proposed as an MDP-based iterative deep research paradigm that replaces mono-contextual linear accumulation with periodic workspace reconstruction, enabling agents to scale to 2048 interactions within a 40K context length (performance improves from 3.5% to 42.5%), surpassing open-source agents by an average of 14.5 percentage points across 6 benchmarks.
LycheeDecode: Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding: LycheeDecode is proposed to accelerate long-context LLM decoding by fine-grainedly partitioning attention heads into a small number of retrieval heads (performing full attention to select critical tokens) and a large number of sparse heads (reusing the selected tokens for sparse computation). Head roles are learned end-to-end via the Hard Kumaraswamy distribution, achieving 2.7× speedup at 128K context length with no performance degradation.
LycheeDecode: Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding: This paper proposes LycheeDecode, a fine-grained hybrid-head sparse decoding method that partitions attention heads into a small number of "retrieval heads" and a large number of "sparse heads," employing the HardKuma distribution for differentiable head-type identification. The method achieves a 2.7× speedup under 128K context while matching or surpassing full-attention baselines.
MVAR: Visual Autoregressive Modeling with Scale and Spatial Markovian Conditioning: This paper proposes MVAR (Markovian Visual AutoRegressive), which introduces a scale Markov assumption (conditioning only on the adjacent preceding scale rather than all prior scales) and spatial Markov attention (restricting neighborhood size to $k$), reducing VAR's attention complexity from $\mathcal{O}(N^2)$ to $\mathcal{O}(Nk)$. MVAR achieves comparable or superior performance on ImageNet 256×256 while reducing inference memory by 3.0–4.2×, and requires only 8 RTX 4090 GPUs for training.
One-Prompt Strikes Back: Sparse Mixture of Experts for Prompt-based Continual Learning: This paper proposes SMoPE, a framework that organizes a single shared prompt into multiple prompt experts within a sparse MoE structure. Dynamic sparse activation is achieved via prompt-attention score aggregation, significantly alleviating knowledge interference while maintaining high parameter efficiency, achieving SOTA on multiple continual learning benchmarks.
RACE Attention: A Strictly Linear-Time Attention for Long-Sequence Training: This paper proposes RACE Attention — replacing softmax with a power angular kernel and approximating attention outputs via differentiable LSH sketches — achieving strictly linear time complexity, supporting up to 12M tokens on a single GPU and 75M tokens on a single CPU, while matching or surpassing softmax accuracy across diverse tasks.
Randomization Boosts KV Caching, Learning Balances Query Load: A Joint Perspective: This paper proposes the first unified mathematical model for KV cache-aware load balancing, introducing a randomized leaf-node eviction algorithm RLT (with $O(\log n)$ competitive ratio) and a learning-based greedy router LBGR, achieving up to 11.96× latency reduction and 14.06× TTFT reduction in multi-LLM serving scenarios.
Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling: This paper proposes the Semantic Parallelism (SP) paradigm, which predicts token-expert routing paths and co-schedules model placement with data dispatch to substantially reduce all-to-all communication overhead in MoE inference under expert parallelism. It achieves up to 2.78× throughput improvement in Attention-DP settings and up to 24.9% latency reduction in Attention-TP settings.
SwingArena: Adversarial Programming Arena for Long-context GitHub Issue Solving: This paper proposes SwingArena, an adversarial evaluation framework in which two LLMs alternately play the roles of patch submitter and test reviewer on real GitHub issues, with end-to-end verification through repository-native CI pipelines (compilation / lint / regression tests). Evaluated on 400 instances across C++, Python, Rust, and Go, the framework reveals behavioral divergence between models in terms of "aggressive patch generation" versus "defensive quality assurance."
TokenSeek: Memory Efficient Fine Tuning via Instance-Aware Token Selection: This paper proposes TokenSeek, a general instance-aware token seeking and discarding method that evaluates token importance by combining contextual (attention) and gradient information, updates parameters only on selected tokens, and achieves up to 65.7% reduction in activation memory while maintaining or surpassing full-token fine-tuning performance.
Understanding and Improving Length Generalization in Hierarchical Sparse Attention Models: This paper systematically dissects chunk-based sparse attention architectures, identifies three critical design principles (nonlinear Chunk Encoder + CLS token, Bypassing Residual Path, and enforced training-time sparsity), and successfully extrapolates a model trained on 4K context to 32 million tokens.
Universe Routing: Why Self-Evolving Agents Need Epistemic Control: This paper formalizes the tendency of autonomous agents to conflate incompatible epistemological frameworks (e.g., frequentist vs. Bayesian) during chain-of-thought reasoning as the "universe routing" problem. A lightweight 465M-parameter router is trained to classify queries into 7 mutually exclusive belief spaces and dispatch them to dedicated solvers. The work demonstrates that hard routing is 7× faster than soft MoE at equal accuracy, and that a modular architecture with rehearsal enables continual learning with zero forgetting.
When Does Divide and Conquer Work for Long Context LLM? A Noise Decomposition Framework: This paper proposes a theoretical framework that decomposes long-context task failures into three types of noise (task noise / model noise / aggregator noise), proves that weak models with chunked processing can outperform strong models with full-context processing when model noise grows superlinearly, and provides a method to efficiently estimate the optimal chunk size using only 3–5 samples.
xLSTM Scaling Laws: Competitive Performance with Linear Time-Complexity: This paper systematically compares the scaling laws of xLSTM and Transformer, demonstrating that xLSTM strictly dominates Transformer of the same scale on the training loss–compute Pareto frontier, in the overtrained regime, and in inference speed, with the advantage growing as context length increases.

🎬 Video Generation¶

Arbitrary Generative Video Interpolation: ArbInterp proposes a generative video frame interpolation framework supporting arbitrary timestamps and arbitrary sequence lengths. It achieves precise temporal control via Timestamp-aware Rotary Position Embedding (TaRoPE) and enables seamless long-sequence stitching through an appearance-motion decoupled conditioning strategy.
BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration: BindWeave replaces conventional shallow fusion mechanisms with a Multimodal Large Language Model (MLLM) to parse complex multi-subject textual instructions, generating subject-aware hidden states as conditioning signals for a DiT. Combined with CLIP semantic features and VAE fine-grained appearance features, it achieves high-fidelity, subject-consistent video generation.
Frame Guidance: Training-Free Guidance for Frame-Level Control in Video Diffusion Models: This paper proposes Frame Guidance, a training-free frame-level guidance method that enables controllable video generation tasks — including keyframe guidance, stylization, and looping video — without modifying the model, via two core components: latent slicing (reducing memory by 60×) and Video Latent Optimization (VLO).
Geometry-aware 4D Video Generation for Robot Manipulation: This paper proposes a geometry-aware 4D video generation framework that trains a video diffusion model via cross-view pointmap alignment supervision, jointly predicting RGB and pointmap sequences to achieve spatiotemporally consistent multi-view RGB-D videos. Without requiring camera pose inputs at inference, the framework generates consistent videos from novel viewpoints and recovers robot end-effector trajectories using an off-the-shelf 6DoF pose tracker.
JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization: This paper proposes JavisDiT, a joint audio-video generation model built on the DiT architecture. It achieves fine-grained spatio-temporal audio-video alignment via a Hierarchical Spatio-Temporal Synchronization Prior Estimator (HiST-Sypo). The work also introduces a new benchmark, JavisBench (10K complex-scene samples), and a new evaluation metric, JavisScore.
JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation: This paper proposes JavisDiT++, a clean and unified framework for joint audio-video generation (JAVG). It improves generation quality via modality-specific MoE, achieves frame-level synchronization through temporally aligned RoPE, and aligns outputs with human preferences via audio-video DPO. Built on Wan2.1-1.3B with only ~1M public data, it achieves state-of-the-art performance.
Language-guided Open-world Video Anomaly Detection under Weak Supervision: This paper proposes LaGoVAD, a language-guided open-world video anomaly detection paradigm that models anomaly definitions as random variables provided in natural language. Combined with dynamic video synthesis and contrastive learning regularization, it achieves zero-shot state-of-the-art performance across seven datasets.
Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control: This paper proposes RoboMaster, a framework that decomposes the robot–object interaction process into three temporal stages—pre-interaction, in-interaction, and post-interaction—via a collaborative trajectory representation, combined with appearance- and shape-aware object embeddings, to achieve high-quality video generation for robotic manipulation.
LoRA-Edit: Controllable First-Frame-Guided Video Editing via Mask-Aware LoRA Fine-Tuning: This paper proposes LoRA-Edit, which leverages spatiotemporal masks to guide LoRA fine-tuning of a pretrained I2V model, enabling controllable first-frame-guided video editing. The mask simultaneously serves as an instruction for the editing region and a guidance signal for LoRA learning, supporting motion inheritance and appearance control.
Lumos-1: On Autoregressive Video Generation with Discrete Diffusion from a Unified Model Perspective: This paper proposes Lumos-1, a unified video generation model built on a standard LLM architecture. It addresses visual spatiotemporal encoding via MM-RoPE (distributed multi-modal RoPE) and inter-frame loss imbalance via AR-DF (autoregressive discrete diffusion forcing). Trained with only 48 GPUs, Lumos-1 achieves competitive performance on GenEval, VBench-I2V, and VBench-T2V.
MoSA: Motion-Coherent Human Video Generation via Structure-Appearance Decoupling: MoSA decomposes human video generation into a structure generation stage (a 3D Transformer generates physically plausible motion skeletons) and an appearance generation stage (a DiT synthesizes video conditioned on the skeletons). A Human-Aware Dynamic Control (HADC) module propagates sparse skeleton signals across the entire motion region. Combined with a dense tracking loss and contact constraints, MoSA comprehensively outperforms SOTA models such as HunyuanVideo and Wan 2.1 on FVD, CLIPSIM, and other metrics.
MotionStream: Real-Time Video Generation with Interactive Motion Controls: MotionStream is proposed as the first real-time streaming video generation system with motion control. It first trains a bidirectional motion-control teacher with a lightweight track head on Wan DiT, then distills it into a causal student via Self Forcing + DMD. Attention sink and rolling KV cache are introduced to achieve full train-inference distribution matching, enabling infinite-length generation at constant speed — reaching 17 FPS / 29 FPS (+ Tiny VAE) at 480P on a single H100 GPU.
PreciseCache: Precise Feature Caching for Efficient and High-fidelity Video Generation: This paper proposes PreciseCache — a plug-and-play acceleration framework that precisely detects and skips genuinely redundant computations in video generation. It consists of LFCache (step-level, based on a Low-Frequency Difference (LFD) metric) and BlockCache (block-level, based on an input-output difference metric), achieving an average 2.6× speedup with negligible quality degradation on mainstream models such as Wan2.1-14B.
QuantSparse: Comprehensively Compressing Video Diffusion Transformer with Model Quantization and Attention Sparsification: This paper proposes QuantSparse, the first framework to jointly integrate model quantization and attention sparsification for video diffusion Transformer compression. By introducing Multi-Scale Salient Attention Distillation (MSAD) and Second-Order Sparse Attention Reparameterization (SSAR), QuantSparse addresses the "amplified attention shift" problem caused by naive combination of the two techniques. On HunyuanVideo-13B with W4A8 and 15% attention density, it achieves 3.68× storage compression and 1.88× inference speedup with nearly lossless generation quality.
SIGMark: Scalable In-Generation Watermark with Blind Extraction for Video Diffusion: SIGMark proposes the first blind watermarking framework for modern video diffusion models, achieving scalable blind extraction with constant retrieval cost via Global Frame-level Pseudorandom Coding (GF-PRC), and addresses temporal perturbations under causal 3D VAE through a Segmented Group Ordering (SGO) module, attaining high bit accuracy and strong robustness on HunyuanVideo and Wan-2.2.
Streaming Autoregressive Video Generation via Diagonal Distillation: This paper proposes Diagonal Distillation (DiagDistill), which achieves 277.3× acceleration and 31 FPS real-time streaming autoregressive video generation via a diagonal denoising strategy (more steps for early chunks, fewer for later chunks) and a flow distribution matching loss.
Target-Aware Video Diffusion Models: This paper proposes a target-aware video diffusion model that generates videos of an actor interacting with a specified target object, given only a single input image and a segmentation mask of the target. The core innovations are the introduction of a special [TGT] token and a selective cross-attention loss that guides the model to attend to the spatial location of the target, achieving comprehensive improvements over baselines in both target alignment and video quality.
Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator: This paper proposes VIST3A, a framework that seamlessly connects the latent space of a pretrained video generator to a feed-forward 3D reconstruction model (e.g., AnySplat/MVDUSt3R/VGGT) via model stitching, and employs direct reward finetuning to align the generative model with the stitched 3D decoder. The approach enables high-quality end-to-end text-to-3DGS and text-to-pointmap generation, achieving state-of-the-art results on T3Bench, SceneBench, and DPG-Bench.
TTOM: Test-Time Optimization and Memorization for Compositional Video Generation: This paper proposes TTOM, a framework that aligns attention maps of video generation models with LLM-generated spatiotemporal layouts by optimizing newly introduced parameters at inference time, while a parameter memorization mechanism stores historical optimization contexts for reuse. TTOM achieves relative improvements of 34% (CogVideoX) and 14% (Wan2.1) on T2V-CompBench.

🚗 Autonomous Driving¶

Adaptive Augmentation-Aware Latent Learning for Robust LiDAR Semantic Segmentation: This paper proposes A3Point (Adaptive Augmentation-Aware Latent Learning), a training framework that addresses the augmentation dilemma in robust LiDAR segmentation via two core components: Semantic Confusion Prior (SCP) implicit learning and Semantic Shift Region (SSR) localization. By decoupling model-inherent semantic confusion from augmentation-induced semantic shift and adaptively optimizing across varying perturbation intensities, A3Point achieves state-of-the-art performance on multiple adverse-weather LiDAR segmentation generalization benchmarks.
SMART-R1: Advancing Multi-agent Traffic Simulation via R1-Style Reinforcement Fine-Tuning: SMART-R1 is the first work to introduce R1-style reinforcement fine-tuning (RFT) into multi-agent traffic simulation. It proposes the Metric-oriented Policy Optimization (MPO) algorithm and an iterative "SFT-RFT-SFT" training strategy, achieving first place on the WOSAC 2025 leaderboard with a Realism Meta score of 0.7858.
Astra: General Interactive World Model with Autoregressive Denoising: This paper proposes Astra, a general interactive world model that enables action-conditioned long-horizon video prediction on top of a pretrained video diffusion model via an autoregressive denoising framework. Three key contributions are introduced: ACT-Adapter (action injection), noise-augmented history memory (to mitigate visual inertia), and Mixture of Action Experts (to unify heterogeneous action modalities). Astra achieves state-of-the-art fidelity and action-following capability across autonomous driving, robotic manipulation, and scene exploration scenarios.
BridgeDrive: Diffusion Bridge Policy for Closed-Loop Trajectory Planning in Autonomous Driving: BridgeDrive proposes replacing truncated diffusion with a diffusion bridge to achieve anchor-guided trajectory planning for autonomous driving, ensuring theoretical symmetry between the forward and reverse processes. On the Bench2Drive closed-loop benchmark, it achieves success rates of 74.99% (PDM-Lite) and 89.25% (LEAD), surpassing the previous SOTA by 7.72% and 2.45%, respectively.
DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving: DrivingGen introduces the first comprehensive benchmark for autonomous driving video world models, comprising a diverse evaluation dataset spanning weather/geography/time/complex scenarios and a four-dimensional metric framework (distribution, quality, temporal consistency, trajectory alignment). Evaluation of 14 SOTA models reveals a fundamental trade-off between general-purpose and driving-specific models.
EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video: Apple collected 829 hours of egocentric video paired with 3D hand joint tracking data (EgoDex) using Vision Pro, covering 194 tabletop manipulation tasks and 338K trajectories. The dataset is used to systematically benchmark imitation learning policies (BC/DDPM/FM + Transformer), providing the largest-scale data foundation to date for scaling dexterous manipulation training.
MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding: MARC is a framework that adopts a "retrieve-then-compress" strategy: a Visual Memory Retriever (VMR) selects the most query-relevant video segments, and Compression GRPO (C-GRPO) distills the reasoning capability of a 64-frame teacher model into a student model that operates on only 1-frame tokens. This achieves 95% visual token compression, 72% GPU memory reduction, 23.9% inference latency reduction, with virtually no performance loss (42.20 vs. 42.21).
Multi-Head Low-Rank Attention (MLRA): This paper proposes Multi-Head Low-Rank Attention (MLRA), which decomposes the single latent head in MLA into multiple independently shardable latent heads and sums the attention outputs across branches, enabling native 4-way tensor parallelism. The method achieves 2.8× decoding speedup while maintaining state-of-the-art performance.
NeMo-map: Neural Implicit Flow Fields for Spatio-Temporal Motion Mapping: NeMo-map is proposed as a continuous spatio-temporal dynamic map based on neural implicit functions, directly mapping spatial-temporal coordinates to Semi-Wrapped Gaussian Mixture Model (SWGMM) parameters. It eliminates the spatial discretization and temporal segmentation constraints of conventional methods, achieving lower NLL and smoother velocity distributions on real pedestrian tracking data.
ResWorld: Temporal Residual World Model for End-to-End Autonomous Driving: ResWorld proposes a Temporal Residual World Model (TR-World) that extracts dynamic object information by computing temporal residuals of BEV scene representations—without relying on detection or tracking—thereby avoiding redundant modeling of static regions. Combined with a Future-Guided Trajectory Refinement (FGTR) module that leverages predicted future BEV features to refine planned trajectories, ResWorld achieves state-of-the-art planning performance on nuScenes and NAVSIM.
SEAL: Segment Any Events with Language: This paper introduces the open-vocabulary event instance segmentation (OV-EIS) task for the first time, and proposes the SEAL framework. Through multimodal hierarchical semantic guidance (MHSG) and a lightweight multimodal fusion network, SEAL achieves multi-granularity (instance-level + part-level) semantic segmentation of event streams using only event–image pairs (without dense annotations), substantially outperforming all baselines while achieving the fastest inference speed.
SiMO: Single-Modality-Operable Multimodal Collaborative Perception: This paper proposes SiMO, a framework that introduces the LAMMA fusion module and PAFR training strategy to achieve, for the first time in multi-agent collaborative perception, a multimodal perception system that remains operational under arbitrary modality absence—particularly when LiDAR fails and only cameras are available. The design is analogous to a parallel circuit: the system functions as long as at least one pathway is active.
Single Pixel Image Classification using an Ultrafast Digital Light Projector: This paper presents an experimental single-pixel imaging (SPI) system based on a microLED-on-CMOS ultrafast digital light projector, combined with low-complexity machine learning models (ELM and DNN) to achieve sub-millisecond image encoding and kHz-rate image classification. The system attains >90% accuracy on the MNIST dataset and >99% AUC in binary classification scenarios.
SPACeR: Self-Play Anchoring with Centralized Reference Models: SPACeR proposes a "human-like self-play" framework that uses a pretrained tokenized autoregressive motion model as a centralized reference policy. By incorporating log-likelihood rewards and KL divergence constraints, it guides a decentralized self-play RL policy to align with the human driving distribution. SPACeR outperforms pure self-play methods on WOSAC while achieving 10× faster inference and 50× fewer parameters than imitation learning approaches.
Spectral-Geometric Neural Fields for Pose-Free LiDAR View Synthesis: SG-NLF proposes a pose-free LiDAR NeRF framework that integrates spectral information with geometric consistency. It leverages a hybrid spectral-geometric representation for continuous smooth geometry reconstruction, a confidence-aware pose graph for global pose optimization, and an adversarial learning strategy to enforce cross-frame consistency, achieving improvements of 35.8% in reconstruction quality and 68.8% in pose accuracy over the previous state of the art.
ST4VLA: Spatially Guided Training for Vision-Language-Action Models: This paper proposes ST4VLA, a two-stage spatially guided training framework (spatial grounding pre-training + spatially guided action post-training) that explicitly injects VLM spatial priors into VLA policy learning. On SimplerEnv, it improves the Google Robot success rate from 66.1% to 84.6% and WidowX from 54.7% to 73.2%, achieving state-of-the-art performance.
Steerable Adversarial Scenario Generation through Test-Time Preference Alignment (SAGE): SAGE reformulates adversarial scenario generation for autonomous driving as a multi-objective preference alignment problem. By training two preference expert models and performing weight interpolation at inference time, it enables a continuous and steerable trade-off between adversariality and realism—without retraining—generating a full spectrum of scenarios from mild to aggressive, substantially improving closed-loop training performance.
x²-Fusion: Cross-Modality and Cross-Dimension Flow Estimation in Event Edge Space: x²-Fusion introduces Event Edge Space — the first edge-based isomorphic latent space — that unifies image, LiDAR, and event camera features into a shared edge-centric representation. Combined with reliability-aware adaptive fusion and cross-dimension contrastive learning, it achieves state-of-the-art joint 2D optical flow and 3D scene flow estimation under both standard and degraded conditions.

🔗 Causal Inference¶

Action-Guided Attention for Video Action Anticipation: This paper proposes an Action-Guided Attention (AGA) mechanism that uses the model's own action prediction sequences as the Query and Key in attention (rather than pixel-level features), combined with adaptive gated fusion of historical context and current frame features. AGA achieves strong generalization from validation to test set on EPIC-Kitchens-100 and supports post-hoc interpretability analysis.
AgentTrace: Causal Graph Tracing for Root Cause Analysis in Deployed Multi-Agent Systems: This paper proposes AgentTrace, a framework that constructs causal graphs from execution logs of multi-agent systems and localizes root cause nodes via backward tracing combined with lightweight feature-based ranking (a weighted linear combination of five feature groups). On 550 synthetic fault scenarios, AgentTrace achieves Hit@1 of 94.9% with a latency of 0.12 seconds—69× faster than LLM-based analysis.
Copy-Paste to Mitigate Large Language Model Hallucinations: This paper proposes a Copy-Paste generation paradigm that trains LLMs to preferentially copy spans directly from retrieved context rather than paraphrasing them freely. Combined with high-copy-preference DPO training, the approach improves faithfulness on counterfactual RAG benchmarks from 80.2% to 92.8%.
Counterfactual Explanations on Robust Perceptual Geodesics: This paper proposes PCG (Perceptual Counterfactual Geodesic), a method that generates semantically faithful counterfactual explanations by optimizing geodesics on a robust perceptual manifold. A two-stage optimization ensures that the resulting path is both perceptually natural and reaches the target class. PCG achieves FID=8.3 on AFHQ, substantially outperforming RSGD (FID=12.9).
Direct Doubly Robust Estimation of Conditional Quantile Contrasts: This paper proposes the first direct estimation method for the conditional quantile contrast (CQC) by explicitly parameterizing the CQC and combining it with doubly robust gradient descent. The approach maintains theoretical double robustness while empirically outperforming existing indirect inversion methods across estimation accuracy, interpretability, and computational efficiency.
Distributional Equivalence in Linear Non-Gaussian Latent-Variable Cyclic Causal Models: This work provides, for the first time in the linear non-Gaussian setting and without any structural assumptions, a complete graphical criterion for distributional equivalence among causal graphs with latent variables and cycles. The central technical tool is the newly proposed edge rank constraints, upon which algorithms for enumerating equivalence classes and recovering causal models from data are developed — representing the first equivalence characterization and discovery method in parametric causal models that requires no structural assumptions.
Efficient Ensemble Conditional Independence Test Framework for Causal Discovery: This paper proposes E-CIT (Ensemble Conditional Independence Test), a framework that partitions data into subsets, performs independent tests on each subset, and aggregates the resulting p-values via a stable distribution-based combination method. E-CIT reduces the computational complexity of any base CIT to linear in sample size, while maintaining or improving test power in challenging settings such as heavy-tailed noise and real-world data.
Flattery, Fluff, and Fog: Diagnosing and Mitigating Idiosyncratic Biases in Preference Models: This paper systematically investigates the over-reliance of preference models on five surface-level features (verbosity, structure, jargon, sycophancy, and vagueness). By constructing causal counterfactual pairs, it quantifies how biases originate from distributional imbalances in training data, and proposes a post-training method based on Counterfactual Data Augmentation (CDA) that reduces the average miscalibration rate relative to human judgments from 39.4% to 32.5%.
Function Induction and Task Generalization: An Interpretability Study with Off-by-One Addition: Using off-by-one addition (e.g., 1+1=3, 2+2=5) as a counterfactual task, this paper applies path patching to reveal a function induction mechanism within large language models — an attention head circuit that performs inductive reasoning at the function level, transcending token-level pattern matching — and demonstrates that this mechanism is reused across tasks.
Journey to the Centre of Cluster: Harnessing Interior Nodes for A/B Testing under Network Interference: To address the high-variance problem in GATE estimation for A/B testing under network interference, this paper proposes the Mean-in-Interior (MII) estimator—which averages only over interior nodes within each cluster to substantially reduce variance—and further introduces a counterfactual predictor to correct for covariate shift, yielding the augmented AMII estimator that achieves low bias and low variance simultaneously.
Learning Robust Intervention Representations with Delta Embeddings: This paper proposes the Causal Delta Embedding (CDE) framework, which represents interventions/actions as vector differences between pre- and post-intervention states in the latent space. Three constraints—independence, sparsity, and invariance—are imposed on the delta vectors to learn robust intervention representations. The framework significantly surpasses baselines on the Causal Triplet benchmark in OOD generalization, and autonomously discovers anti-parallel semantic structures for antonymous actions.
On the Eligibility of LLMs for Counterfactual Reasoning: A Decompositional Study: This paper proposes a decompositional evaluation framework grounded in Structural Causal Models (SCMs), decomposing LLM counterfactual reasoning into four stages (causal variable identification → causal graph construction → intervention identification → outcome reasoning). It systematically diagnoses capability bottlenecks at each stage across 11 multimodal datasets, and introduces tool-augmented and advanced elicitation strategies to improve performance.
Resisting Contextual Interference in RAG via Parametric-Knowledge Reinforcement: This paper proposes Knowledgeable-R1, a reinforcement learning-based framework that jointly samples trajectories from parametric knowledge (PK) and contextual knowledge (CK), combined with local/global advantage estimation and adaptive asymmetric advantage transformation, enabling LLMs to resist misleading retrieved contexts in RAG scenarios while preserving the ability to leverage reliable context.
RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Perturbations: This paper proposes a formal framework for reasoning faithfulness (stance consistency + causal influence) and the RFEval benchmark (7,186 instances × 7 tasks). By applying output-level counterfactual interventions to evaluate 12 open-source LRMs, it finds that 49.7% of outputs are unfaithful and that accuracy is not a reliable proxy for faithfulness.
Self-Supervised Learning from Structural Invariance: This paper proposes AdaSSL, which introduces latent variables to model conditional uncertainty between positive pairs, derives a variational lower bound on mutual information, and enables SSL to handle complex (multimodal, heteroscedastic) conditional distributions in naturally paired data. AdaSSL outperforms baselines on causal representation learning, fine-grained image understanding, and video world models.
SelfReflect: Can LLMs Communicate Their Internal Answer Distribution?: This paper proposes SelfReflect — an information-theoretic distance metric that measures the discrepancy between an LLM's self-reported uncertainty summary and its true internal answer distribution. The study finds that modern LLMs are broadly incapable of autonomously reflecting their internal uncertainty, but that faithful uncertainty summaries can be generated by sampling multiple outputs and feeding them back into the context.
Synthesising Counterfactual Explanations via Label-Conditional Gaussian Mixture Variational Autoencoders: This paper proposes L-GMVAE (Label-Conditional Gaussian Mixture VAE) and the LAPACE algorithm. By learning multiple Gaussian cluster centers per class in the latent space and performing linear interpolation from the input's latent representation to the target class center, the method generates path-based counterfactual explanations while guaranteeing validity, plausibility, diversity, and perfect robustness to input perturbations.
Validating Interpretability in siRNA Efficacy Prediction: A Perturbation-Based, Dataset-Aware Protocol: This paper proposes a standardized perturbation-based saliency faithfulness validation protocol for siRNA efficacy prediction, serving as a "pre-synthesis checkpoint" to assess the reliability of saliency maps. The authors further introduce BioPrior, a biologically informed regularization method to improve saliency faithfulness. Results show that 19/20 fold-dataset instances pass the validation, while cross-dataset transfer reveals two distinct failure modes.

🖼️ Image Restoration¶

Activation Steering for Masked Diffusion Language Models: This work is the first to apply activation steering to Masked Diffusion Language Models (MDLMs), demonstrating that refusal behavior in MDLMs is likewise governed by a single low-dimensional direction. Globally projecting out this direction at every denoising step completely bypasses safety alignment. Unlike autoregressive models, effective directions can be extracted from pre-instruction tokens—reflecting the non-causal, parallel processing nature of diffusion models.
AdaBlock-dLLM: Semantic-Aware Diffusion LLM Inference via Adaptive Block Size: Through statistical analysis of token confidence dynamics during the denoising process of diffusion language models (dLLMs), this work identifies a "Volatility Band" (VB) region that encodes local semantic structure in text. Building on this observation, it proposes AdaBlock-dLLM—a training-free, plug-and-play adaptive block size scheduler that aligns block boundaries in semi-autoregressive decoding with natural semantic steps, achieving up to 5.3% accuracy improvement at the same throughput.
Are Deep Speech Denoising Models Robust to Adversarial Noise?: This paper presents the first systematic evaluation of the robustness of four SOTA deep speech denoising (DNS) models against adversarial noise. By generating perceptually imperceptible adversarial perturbations via PGD attacks constrained by psychoacoustic masking, the authors demonstrate that Demucs, Full-SubNet+, FRCRN, and MP-SENet can be made to produce completely unintelligible gibberish. The evaluation covers diverse acoustic conditions and human listening studies, while also revealing the limitations of targeted attacks, universal perturbations, and cross-model transferability.
Beyond Scattered Acceptance: Fast and Coherent Inference for DLMs via Longest Stable Prefixes: The LSP scheduler atomically commits the longest contiguous stable prefix at each denoising step—rather than accepting scattered discrete tokens—achieving up to 3.4× speedup in DLM inference while maintaining or slightly improving output quality.
Breaking Scale Anchoring: Frequency Representation Learning for Accurate High-Resolution Inference from Low-Resolution Training: This paper defines the novel problem of "Scale Anchoring" (SA)—wherein training on low-resolution data causes inference errors to remain anchored at training-resolution levels during high-resolution inference—and proposes an architecture-agnostic Frequency Representation Learning (FRL) method. By introducing Nyquist-normalized frequency encodings, FRL enables errors to decrease as resolution increases, with effectiveness validated across 8 mainstream architectures.
DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation: This paper proposes DiffusionBlocks, which interprets the layer-wise updates of residual networks as discretization steps of a continuous-time diffusion process, enabling the network to be partitioned into fully independently trainable blocks. This approach achieves competitive performance with end-to-end training while reducing training memory by a factor of $B$ (the number of blocks).
Generalizing Linear Autoencoder Recommenders with Decoupled Expected Quadratic Loss: This paper generalizes the EDLAE recommendation model's objective to a Decoupled Expected Quadratic Loss (DEQL), derives closed-form solutions over a broader hyperparameter range ($b>0$), and reduces computational complexity from $O(n^4)$ to $O(n^3)$ via the Miller matrix inversion lemma, surpassing EDLAE and deep learning models on multiple benchmark datasets.
Horizon Imagination: Efficient On-Policy Rollout in Diffusion World Models: This paper proposes Horizon Imagination (HI), which samples actions at an intermediate denoising step and processes multiple future frames in parallel, reducing the per-frame computation of on-policy imagination in diffusion world models to less than one full denoising pass while maintaining control performance.
InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions: InterActHuman is proposed to enable audio-driven video generation for multi-person/human interaction scenarios via an automatic spatiotemporal layout mask predictor and an iterative mask guidance strategy, supporting independent speech-driven lip synchronization and body motion for each character.
Mechanism of Task-oriented Information Removal in In-context Learning: This paper proposes a novel "information removal" perspective to explain the internal mechanism of In-context Learning (ICL): it finds that under zero-shot settings, language models encode queries into "non-selective representations" containing information about all possible tasks (leading to near-random outputs), while the core function of few-shot ICL is to simulate a "task-oriented information removal" process—through identified "Denoising Heads" that selectively remove redundant task information from entangled representations, guiding the model to focus on the target task. Ablation experiments confirm that blocking Denoising Heads significantly degrades ICL accuracy.
ProtoTS: Learning Hierarchical Prototypes for Explainable Time Series Forecasting: ProtoTS is proposed to achieve explainable time series forecasting via hierarchical prototype learning: a small number of coarse-grained prototypes provide a global pattern overview, while progressive refinement captures local variations. Heterogeneous exogenous variables are handled through multi-channel embedding and bottleneck fusion. On the LOF dataset, MSE is reduced by 48.3% and MAE by 20.9%. The framework additionally supports expert editing of prototypes to further improve performance.
Sharpness-Aware Machine Unlearning: This paper systematically analyzes the theoretical properties of SAM in the machine unlearning setting through a signal-noise decomposition framework. It finds that SAM abandons its denoising capability on the forget set while retaining it on the retain set. Motivated by this finding, the paper proposes the Sharp MinMax algorithm, which partitions the model into two components subject to sharpness minimization (retain) and sharpness maximization (forget) respectively, achieving state-of-the-art unlearning performance.
Skip to the Good Part: Representation Structure & Inference-Time Layer Skipping in Diffusion vs. Autoregressive LLMs: This work presents the first systematic comparison of layer-wise representation structure between diffusion large language models (dLLMs) and autoregressive (AR) LLMs. It finds that natively trained dLLMs exhibit stronger hierarchical abstraction and greater early-layer redundancy. Based on this finding, a static, task-agnostic inference-time layer skipping strategy is proposed, achieving 90%+ performance retention on LLaDA while skipping 6 layers (18.75% FLOPs reduction).
Trust but Verify: Adaptive Conditioning for Reference-Based Diffusion Super-Resolution: This paper proposes Ada-RefSR, a single-step reference-guided diffusion super-resolution framework based on the "Trust but Verify" principle. It introduces an Adaptive Implicit Correlation Gating (AICG) mechanism that maximally exploits reliable reference information while suppressing erroneous fusion, incurring only 0.13% additional computational overhead.
wd1: Weighted Policy Optimization for Reasoning in Diffusion Language Models: This paper proposes wd1, a ratio-free weighted log-likelihood policy optimization method for RL fine-tuning of diffusion language models (dLLMs). By combining positive-sample weighting with negative-sample penalization, wd1 avoids the bias and high variance introduced by policy ratio estimation in GRPO, achieving state-of-the-art performance of +59% on Sudoku and 84.5% on GSM8K over LLaDA-8B.

🔄 Self-Supervised Learning¶

Difficult Examples Hurt Unsupervised Contrastive Learning: A Theoretical Perspective: This paper provides rigorous theoretical proof via a similarity graph model that difficult examples (cross-class sample pairs with high similarity) hurt unsupervised contrastive learning — they strictly worsen the generalization error bound. Three theoretically grounded mitigation strategies are proposed: removing difficult examples, adjusting margins, and temperature scaling. On TinyImageNet, the approach yields up to a 10.42% improvement in linear probing accuracy. This finding is counterintuitive: while "more data is better" is a common principle in deep learning, carefully removing difficult examples in contrastive learning is in fact beneficial.
Enhancing Molecular Property Predictions by Learning from Bond Modelling and Interactions: DeMol is a dual-graph enhanced multi-scale interaction framework that introduces parallel atom-centric and bond-centric channels along with Double-Helix Blocks to explicitly model atom–atom, atom–bond, and bond–bond interactions, achieving state-of-the-art performance on PCQM4Mv2, OC20, QM9, and related benchmarks.
Fly-CL: A Fly-Inspired Framework for Enhancing Efficient Decorrelation and Reduced Training Time in Pre-trained Model-based Continual Representation Learning: Inspired by the Drosophila olfactory circuit, Fly-CL is proposed as a framework that achieves progressive decorrelation through three stages — sparse random projection, top-$k$ activation, and streaming ridge classification — significantly reducing training time while attaining state-of-the-art performance in pre-trained model-based continual learning.
Gradient-Sign Masking for Task Vector Transport Across Pre-Trained Models: This paper proposes GradFix, a method that constructs a binary mask from gradient signs computed on a minimal number of samples from the target pre-trained model, and uses it to filter the source model's task vector coordinate-wise, retaining only components aligned with the descent direction of the target loss landscape. Without any fine-tuning, GradFix enables task knowledge transfer across pre-trained models, provides a rigorous first-order descent guarantee in theory, and substantially outperforms both naive transfer and few-shot fine-tuning on vision and language benchmarks.
InfoNCE Induces Gaussian Distribution: This paper theoretically proves that the InfoNCE loss induces representations toward a Gaussian distribution via two complementary mechanisms: an empirical idealization route (alignment + spherical uniformity → Gaussian) and a regularization route (vanishing regularizer → isotropic Gaussian). The findings are validated on synthetic data and CIFAR-10.
Maximizing Asynchronicity in Event-based Neural Networks: This paper proposes EVA, a framework that treats events as language tokens and employs an RWKV-6-based linear attention asynchronous encoder to update features event-by-event. Combined with a self-supervised learning scheme consisting of Multi-Representation Prediction (MRP) and Next-Representation Prediction (NRP), EVA learns generalizable features and, for the first time, successfully tackles the challenging object detection task within the Asynchronous-to-Synchronous (A2S) paradigm (0.477 mAP on the Gen1 dataset).
Maximizing Incremental Information Entropy for Contrastive Learning: This paper proposes IE-CL (Incremental-Entropy Contrastive Learning), a framework that explicitly maximizes entropy gain between augmented views—rather than merely maximizing mutual information—by treating the encoder as an information bottleneck and jointly optimizing a learnable transformation module (for entropy generation) and an encoder regularizer (for entropy preservation). IE-CL consistently improves contrastive learning performance on CIFAR-10/100, STL-10, and ImageNet under small-batch settings, with its core modules serving as plug-and-play components compatible with existing frameworks.
No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves: This paper proposes Self-Representation Alignment (SRA), which identifies that internal representations of diffusion Transformers exhibit a quality gradient along two dimensions—increasing layer depth and decreasing noise level. Based on this observation, SRA aligns early-layer, high-noise representations of a student network to late-layer, low-noise representations of an EMA teacher, requiring no external representation components (DINOv2/CLIP/MAE), and substantially accelerates convergence while improving generation quality on DiT and SiT (SiT-XL/2 achieves FID 1.58 at 800 epochs, comparable to REPA which relies on DINOv2).
PonderLM: Pretraining Language Models to Ponder in Continuous Space: This paper proposes PonderLM, which introduces a "pondering" mechanism at pretraining time — computing a weighted sum of the predicted probability distribution over token embeddings to form a continuous pondering embedding, then performing repeated forward passes. Without labeled data or reinforcement learning, a 2.8B model trained with this approach surpasses a 6.9B baseline on 9 downstream tasks.
Regularized Latent Dynamics Prediction is a Strong Baseline for Behavioral Foundation Models: This paper proposes Regularized Latent Dynamics Prediction (RLDP), which augments a self-supervised latent next-state prediction objective with a simple orthogonality regularization to preserve feature diversity. RLDP matches or surpasses complex state-of-the-art representation learning methods in zero-shot RL, with particularly notable advantages in low-coverage settings.
SNAP-UQ: Self-supervised Next-Activation Prediction for Single-Pass Uncertainty: SNAP-UQ proposes a single-forward-pass uncertainty estimation method tailored for TinyML scenarios. Lightweight int8 prediction heads are attached at selected tap layers of a backbone network; these heads predict the activation statistics of the next layer in a self-supervised manner. The deviation ("surprisal") between predicted and actual activations is aggregated into an uncertainty score. The method requires no additional forward passes, temporal buffers, or ensembles, and adds only tens of kilobytes of flash memory, enabling reliable distribution-shift detection and failure detection on microcontrollers.
Soft Equivariance Regularization for Invariant Self-Supervised Learning: This paper proposes SER (Soft Equivariance Regularization), a layer-decoupled design that applies soft equivariance regularization to intermediate ViT layers while preserving the invariance objective at the final layer. Without introducing additional modules, SER consistently improves classification accuracy and robustness for invariant SSL methods (MoCo-v3, DINO, Barlow Twins).
Spectrum Tuning: Post-Training for Distributional Coverage and In-Context Steerability: This paper reveals that post-training methods such as RLHF and DPO systematically impair models' in-context steerability, output coverage, and distributional alignment. It proposes the Spectrum Suite evaluation framework and the Spectrum Tuning method, representing the first post-training approach that improves distributional alignment.
Temporal Slowness in Central Vision Drives Semantic Object Learning: By simulating human central vision (foveal cropping) and the temporal slowness principle (temporal contrastive learning), SSL models trained on Ego4D data demonstrate that combining these two mechanisms effectively improves semantic object representations — central vision enhances foreground extraction, while temporal slowness distills semantic information during fixation periods.
Why Prototypes Collapse: Diagnosing and Preventing Partial Collapse in Prototypical Self-Supervised Learning: This paper diagnoses that the root cause of partial prototype collapse in prototypical self-supervised learning is shortcut learning induced by joint optimization of the encoder and prototypes. It proposes a fully decoupled training strategy—estimating prototypes independently via an online GMM—to completely eliminate collapse and improve downstream performance.

✂️ Segmentation¶

AMLRIS: Alignment-aware Masked Learning for Referring Image Segmentation: This paper proposes an Alignment-aware Masked Learning (AML) strategy that quantifies vision-language patch-level alignment and filters low-alignment pixels, enabling RIS models to focus on reliable regions during training. Without any architectural modifications, AML achieves state-of-the-art performance across all 8 splits of RefCOCO benchmarks.
ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer: This paper proposes ByteFlow Net, a tokenizer-free hierarchical byte-level language model that leverages information-theoretic coding rate to adaptively compress raw byte streams into semantic units, outperforming BPE baselines and existing byte-level architectures on both pretraining loss and downstream tasks.
Efficient-SAM2: Accelerating SAM2 with Object-Aware Visual Encoding and Memory Retrieval: This paper identifies sparse perception patterns in SAM2 analogous to biological vision — the decoder focuses on foreground while the encoder computes broadly, and only a small subset of tokens in memory frames are active with temporally consistent saliency. Based on these observations, Efficient-SAM2 is proposed, which eliminates redundant computation via object-aware Sparse Window Routing (SWR) and Sparse Memory Retrieval (SMR), achieving 1.68× end-to-end speedup on SAM2.1-L with only 1% accuracy loss.
Locality-Attending Vision Transformer: This paper proposes LocAt, a modular plug-in comprising GAug and PRR, which biases attention toward local neighborhoods via learnable Gaussian kernels and refines patch representations. Without modifying the training objective, it improves ViT segmentation performance on ADE20K by over 6% while simultaneously boosting classification accuracy.
RegionReasoner: Region-Grounded Multi-Round Visual Reasoning: This paper proposes RegionReasoner, a reinforcement learning-based multi-round visual reasoning framework that employs reference annotation rewards and global-local consistency rewards to enforce explicit citation of reference region coordinates in reasoning traces while maintaining semantic coherence. The approach achieves significant improvements in multi-round localization and segmentation accuracy on the newly constructed RegionDial-Bench.
Revisiting [CLS] and Patch Token Interaction in Vision Transformers: This paper systematically analyzes the interaction friction between the global [CLS] token and local patch tokens in Vision Transformers. It reveals that normalization layers implicitly differentiate between the two token types, and proposes specialized processing paths in normalization layers and early QKV projections. With only an 8% parameter increase, the method achieves over 2 mIoU improvement in segmentation while preserving classification accuracy.
Thicker and Quicker: A Jumbo Token for Fast Plain Vision Transformers: This paper proposes Jumbo: a method that expands the ViT CLS token to $J$ times its original width, splits it into $J$ patch-width tokens before attention, and reassembles them after attention for processing by a dedicated wide FFN. With negligible computational overhead, Jumbo substantially increases global modeling capacity, enabling plain ViT to surpass dedicated efficient architectures (EfficientViT, SHViT, MobileNetV4) in high-throughput inference regimes while preserving all architectural advantages of the plain ViT.
TRACE: Your Diffusion Model is Secretly an Instance Edge Detector: This work identifies an "Instance Emergence Point" (IEP) in the denoising trajectory of text-to-image diffusion models, at which self-attention exhibits sharp divergence changes at object boundaries. TRACE leverages IEP localization, ABDiv edge extraction, and single-step distillation to generate high-quality instance edges with an 81× inference speedup—requiring no instance-level annotations—improving unsupervised instance segmentation by +5.1 AP and surpassing point-supervised panoptic segmentation with tag-level supervision by +1.7 PQ.
Universal Multi-Domain Translation via Diffusion Routers: This paper proposes the Diffusion Router (DR), which employs a single noise prediction network conditioned on source/target domain labels to handle all cross-domain mappings. It supports indirect translation via a center domain and direct non-center-domain translation based on a variational upper-bound objective combined with Tweedie refinement, achieving state-of-the-art performance on three large-scale UMDT benchmarks.
VINCIE: Unlocking In-context Image Editing from Video: VINCIE is a framework that first demonstrates that in-context image editing models can be learned entirely from native video data. By annotating videos as interleaved multimodal sequences and designing three proxy tasks (NIP/CSP/NSP), it achieves state-of-the-art performance on multi-turn editing benchmarks, improving the 5-turn editing success rate from less than 2% (baseline) to 25%.
VIRTUE: Visual-Interactive Text-Image Universal Embedder: This paper proposes VIRTUE, a visual-interactive universal embedder that integrates the segmentation model SAM2 with a VLM to support user-specified regions of interest via points, boxes, or masks, producing joint entity-level and global-level embeddings. A million-scale SCaR benchmark is introduced to evaluate visual-interactive retrieval, achieving SOTA on 36 MMEB tasks (+3.1%–8.5%) and 5 SCaR tasks (+15.2%–20.3%).

👥 Social Computing¶

Adaptive Debiasing Tsallis Entropy for Test-Time Adaptation: This paper introduces Tsallis entropy (a generalization of Shannon entropy) into Test-Time Adaptation for vision-language models, and further develops Adaptive Debiasing Tsallis Entropy (ADTE), which customizes a per-class debiasing parameter $q^l$ to select more reliable high-confidence views than Shannon entropy without distribution-specific hyperparameter tuning. ADTE surpasses the state of the art on ImageNet and its 5 variants as well as 10 cross-domain benchmarks.
BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses: This paper presents BiasFreeBench, the first unified framework to systematically compare 8 mainstream debiasing methods (4 prompting + 4 training) at the response level for LLMs. It introduces the Bias-Free Score (BFS) metric and finds that prompting methods—particularly CoT—generally outperform training-based approaches, while DPO demonstrates superior cross-bias-type generalization.
Functional Embeddings Enable Aggregation of Multi-Area SEEG Data for Robust BCI: This paper proposes FunctionalMap, a framework that uses contrastive learning to learn subject-agnostic functional embeddings from intracranial local field potentials (LFPs) as a "functional coordinate system," replacing unreliable MNI anatomical coordinates. Combined with a Transformer, it enables cross-subject and cross-electrode aggregation of neural data and signal reconstruction, validated on a multi-area SEEG dataset from 20 subjects.
Functional Embeddings Enable Aggregation of Multi-Area SEEG Data for Robust BCI: This paper proposes FunctionalMap, a framework that learns subject-agnostic functional embeddings from intracranial local field potentials (LFPs) via contrastive learning, serving as a "functional coordinate system" to replace unreliable MNI anatomical coordinates. Combined with a Transformer, the framework enables cross-subject and cross-electrode neural data aggregation and signal reconstruction, validated on a multi-area SEEG dataset from 20 subjects.
GRADIEND: Feature Learning within Neural Networks Exemplified through Biases: This paper proposes GRADIEND — a gradient-based encoder-decoder architecture that learns interpretable monosemantic features (exemplified by gender) from model gradients via a single bottleneck neuron. The framework not only identifies which weights encode a specific feature, but also directly modifies model weights through the decoder to mitigate bias. Combined with INLP, it achieves state-of-the-art debiasing results across all baseline models.
Human or Machine? A Preliminary Turing Test for Speech-to-Speech Interaction: This work conducts the first speech-based Turing test on 9 state-of-the-art speech-to-speech (S2S) dialogue systems, collecting 2,968 human judgments. Results show that all systems fail the test (pass rates of 7%–31%). The primary bottlenecks lie not in semantic understanding but in paralinguistic features, emotional expression, and conversational persona. The study also introduces an 18-dimensional fine-grained evaluation framework and an interpretable AI judge model.
Propaganda AI: An Analysis of Semantic Divergence in Large Language Models: This paper proposes the RAVEN audit framework, which detects concept-conditioned semantic divergence in LLMs—a propaganda-like behavioral pattern wherein high-level conceptual cues (e.g., ideologies, public figures) trigger anomalously consistent stance responses—by combining intra-model semantic entropy with cross-model divergence analysis.
SAGE: Spatial-visual Adaptive Graph Exploration for Efficient Visual Place Recognition: This paper proposes SAGE, a unified VPR training framework that introduces a lightweight Soft Probing module to enhance local feature discriminability, reconstructs an affinity graph fusing geographic distance and visual similarity online at each epoch, and focuses on the hardest samples via greedy weighted clique expansion. With the DINOv2 backbone frozen and only 1.96M parameters trained, SAGE achieves state-of-the-art results across 8 benchmarks.
Scalable Multi-Task Low-Rank Model Adaptation: This paper systematically analyzes the root causes of multi-task LoRA collapse as the number of tasks scales (uniform regularization destroying shared knowledge + component-level LoRA amplifying gradient conflicts), and proposes mtLoRA: spectral-aware regularization + block-level adaptation + fine-grained routing. The method outperforms the state of the art by an average of 2.3% on 15–25 tasks, while reducing parameters by 47% and training time by 24%.
Stop Wasting Your Tokens: Towards Efficient Runtime Multi-Agent Systems: This paper proposes SupervisorAgent, a lightweight real-time adaptive supervision framework that actively intervenes at critical interaction nodes (error correction, guidance provision, observation purification) via an LLM-free adaptive filter, reducing token consumption of Smolagent on the GAIA benchmark by 29.68% without sacrificing success rate.
When Agents "Misremember" Collectively: Exploring the Mandela Effect in LLM-based Multi-Agent Systems: This paper presents the first systematic study of the Mandela effect (collective false memory) in LLM-based multi-agent systems. It introduces the ManBench benchmark (4,838 questions, 5 interaction protocols), demonstrates that all 13 evaluated LLMs are susceptible to this effect, and proposes prompt-level and model-level mitigation strategies that reduce false memory by 74.40% on average.

🎁 Recommender Systems¶

C2AL: Cohort-Contrastive Auxiliary Learning for Large-scale Recommendation Systems: This paper proposes C2AL (Cohort-Contrastive Auxiliary Learning), which data-drivenly identifies user cohort pairs with maximal distributional divergence and constructs contrastive auxiliary binary classification tasks to regularize the shared encoder. This transforms FM attention weights from sparse to dense, mitigating representation bias for minority cohorts in large-scale recommendation systems. The approach is validated on 6 Meta production models with billions of data points.
CollectiveKV: Decoupling and Sharing Collaborative Information in Sequential Recommendation: By observing significant cross-user similarity (collaborative signals) in KV caches across different users in sequential recommendation, this paper proposes CollectiveKV, which decomposes KV into a low-dimensional user-specific component and a high-dimensional shared component retrieved from a global KV pool, achieving a compression ratio of 0.8% with no performance degradation.
From Evaluation to Defense: Advancing Safety in Video Large Language Models: This work constructs VideoSafetyEval (11.4k video-query pairs covering 19 risk categories), revealing that the video modality degrades safety performance by 34.2%, and proposes VideoSafety-R1, a three-stage framework (Alarm Token + SFT + Safety-guided GRPO) that improves defense success rate by 71.1% on VSE-HH.
GoalRank: Group-Relative Optimization for a Large Ranking Model: This paper theoretically proves that for any Multi-Generator-Evaluator (Multi-G-E) ranking system, there exists a larger generator-only model that approximates the optimal policy with smaller error and satisfies scaling laws. Based on this, GoalRank is proposed—a framework that uses a reward model to construct a group-relative reference policy for training a large generator-only ranking model, achieving significant improvements over SOTA in online A/B testing.
In Agents We Trust, but Who Do Agents Trust? Latent Source Preferences Steer LLM Generations: Through large-scale controlled experiments across 12 LLMs from 6 providers spanning three domains—news, academia, and e-commerce—this paper reveals that LLMs exhibit systematic latent source preferences: when content is semantically identical, merely swapping source labels significantly alters model selection behavior, and this preference cannot be eliminated through prompt engineering.
ProPerSim: Developing Proactive and Personalized AI Assistants through User-Assistant Simulation: This paper proposes ProPerSim, a simulation framework that models daily behaviors of 32 user personas grounded in the Big Five personality model within the Smallville household environment. The AI assistant makes proactive recommendation decisions every 2.5 minutes and learns user preferences via DPO, improving user satisfaction from 2.2/4 to 3.3/4 over a 14-day simulation—providing the first empirical validation of jointly achieving proactivity and personalization.
RAE: A Neural Network Dimensionality Reduction Method for Nearest Neighbors Preservation in Vector Search: This paper proposes RAE (Regularized Auto-Encoder), a dimensionality reduction method based on a linear autoencoder with Frobenius norm regularization. The authors theoretically prove that the regularization coefficient $\lambda$ constrains the condition number $\kappa(W)$ of the encoder matrix via the Rayleigh quotient property, thereby bounding the norm distortion rate and preserving k-NN structure. RAE consistently outperforms PCA, UMAP, MDS, and ISOMAP on four datasets, achieving at least 12% higher k-NN preservation accuracy than PCA under cosine distance, with training requiring only 8 seconds and inference at millisecond latency.
Rejuvenating Cross-Entropy Loss in Knowledge Distillation for Recommender Systems: This paper theoretically demonstrates that CE loss maximizes a lower bound of NDCG in recommender system KD only when a closure assumption is satisfied—the candidate subset must contain the student's top-ranked items. However, the actual KD objective is to distill the ranking of the teacher's top items, and these two requirements conflict, explaining why vanilla CE performs poorly. Accordingly, the paper proposes RCE-KD: the teacher's top-K items are split into two groups based on whether they appear in the student's top-K, handled respectively by exact CE and sampling-approximated closure CE, with an adaptive fusion weight that evolves dynamically throughout training.
Search Arena: Analyzing Search-Augmented LLMs: This paper presents Search Arena — the first large-scale human preference dataset for search-augmented LLMs (24,069 conversations + 12,652 preference votes, 71 languages). Key findings include: user preference is positively influenced by citation quantity even when citations do not support the claims; community-driven platforms are preferred over Wikipedia; search augmentation does not degrade general chat performance, whereas general-purpose LLMs degrade significantly in search scenarios.
Token-Efficient Item Representation via Images for LLM Recommender Systems: This paper proposes I-LLMRec, which leverages item images in place of verbose textual descriptions to represent item semantics in recommender systems. Through a Recommendation-oriented Image-Semantic Alignment (RISA) module and a Recommendation-oriented Embedding Retrieval Inference (RERI) module, the method represents each item with a single token while preserving rich semantics, achieving approximately 2.93× inference speedup and surpassing text-description-based methods in recommendation performance.

🧮 Scientific Computing¶

Astral: Training Physics-Informed Neural Networks with Error Majorants: This paper proposes the Astral loss function — based on a functional a posteriori error majorant — as a replacement for the conventional residual loss in training physics-informed neural networks (PiNNs). The approach enables reliable error estimation throughout training and achieves superior or comparable accuracy across multiple PDE types, including diffusion and Maxwell equations.
Deep Learning for Subspace Regression: This paper formalizes the subspace prediction problem in Reduced Order Modeling (ROM) as a regression task on the Grassmann manifold. It proposes dedicated loss functions and a subspace embedding technique—predicting a higher-dimensional subspace containing the target—to reduce mapping complexity. The approach achieves significant improvements across eigenvalue problems, parametric PDEs, and iterative solver acceleration.
DGNet: Discrete Green Networks for Data-Efficient Learning of Spatiotemporal PDEs: Grounded in Green's function theory, DGNet embeds the superposition principle into a physics-neural hybrid architecture, achieving state-of-the-art accuracy with only tens of training trajectories and demonstrating robust zero-shot generalization to unseen source terms.
DRIFT-Net: A Spectral--Coupled Neural Operator for PDEs Learning: DRIFT-Net is a dual-branch neural operator that addresses autoregressive drift caused by insufficient global spectral coupling in window attention, via controlled low-frequency mixing (spectral branch), local detail fidelity (image branch), and bandwidth fusion through radial gating. It reduces error by 7%–54% on Navier-Stokes benchmarks.
Empirical Stability Analysis of Kolmogorov-Arnold Networks in Hard-Constrained Recurrent Physics-Informed Discovery: This paper systematically evaluates vanilla KAN as a drop-in replacement for MLP in the residual branch of Hard-Constrained Recurrent Physics-Informed Neural Networks (HRPINN) — through 3 complementary studies × 100 random seeds, it finds that KAN is competitive on univariate separable residuals (Duffing's $-0.3x^3$), but systematically fails on multiplicatively coupled residuals (Van der Pol's $(1-x^2)v$) with extreme hyperparameter fragility, while standard MLP exhibits substantially superior stability across nearly all configurations.
HyperKKL: Enabling Non-Autonomous State Estimation through Dynamic Weight Conditioning: This paper proposes HyperKKL, which uses a hypernetwork to encode exogenous input signals and dynamically generate the transformation mapping parameters of a KKL observer, enabling state estimation for non-autonomous nonlinear systems without retraining or online gradient updates. The method is validated on four classical nonlinear systems: Duffing, Van der Pol, Lorenz, and Rössler.
Learning-guided Kansa Collocation for Forward and Inverse PDE Problems: This work extends the meshfree radial basis function (RBF)-based Kansa collocation method from single-variable linear PDEs to coupled multi-variable and nonlinear PDE settings. It incorporates automatic shape-parameter tuning and multiple time-stepping schemes, and provides a systematic comparison against neural PDE solvers such as PINNs and FNO on both forward and inverse problems.
One Operator to Rule Them All? On Boundary-Indexed Operator Families in Neural PDE Solvers: This paper argues that neural PDE solvers, when trained under varying boundary conditions, do not learn a single solution operator but rather a family of operators indexed by boundary conditions. It formalizes the non-identifiability problem induced by boundary distribution shift under ERM from a learning-theoretic perspective.
Policy Myopia as a Mechanism of Gradual Disempowerment in Post-AGI Governance: This paper argues that policy myopia is not an attention-allocation problem but an institutional mechanism that systematically and irreversibly strips humans of governance participation capacity in the post-AGI era — through three coupled positive feedback loops: salience capture, capability cascades, and value lock-in. Standard mitigation measures can only delay but not prevent this process.
Supervised Metric Regularization Through Alternating Optimization for Multi-Regime PINNs: This paper proposes a Topology-Aware PINN (TAPINN) that structures the latent space via supervised metric regularization (Triplet Loss) and stabilizes training through an alternating optimization schedule. On the multi-regime Duffing oscillator benchmark, TAPINN reduces physics residuals by approximately 49% (0.082 vs. 0.160) and gradient variance by 2.18× compared to baselines.

✏️ Knowledge Editing¶

Bilinear Representation Mitigates Reversal Curse and Enables Consistent Model Editing: By training Transformers from scratch on a synthetic relational knowledge graph, this work demonstrates that appropriate regularization induces the emergence of bilinear relational structure in hidden representations. This structure not only overcomes the reversal curse but also enables logically consistent propagation of edits to related facts.
EAMET: Robust Massive Model Editing via Embedding Alignment Optimization: This paper identifies the root cause of large-scale model editing failures as structural inconsistency (embedding misalignment) between key embeddings and residual embeddings, and proposes EAMET, which progressively saves optimized residual embeddings and aligns their neighborhood structure to the key embedding space via a dual KL divergence + MSE loss. EAMET outperforms MEMIT by an average of 14% (CounterFact) and 8% (ZsRE) when simultaneously editing 10k facts across 6 LLMs and 3 datasets, while remaining robust in two challenging scenarios: long-prefix inputs and multi-fact editing under the same subject.
Energy-Regularized Sequential Model Editing on Hyperspheres: This paper interprets performance degradation in sequential model editing through the lens of hyperspherical uniformity (Hyperspherical Energy, HE), and proposes SPHERE: by projecting editing perturbations onto the orthogonal complement of the principal hyperspherical directions of pre-trained weights, SPHERE enables stable large-scale sequential editing, outperforming the strongest baseline by an average of 16.41% on LLaMA3-8B.
Fine-tuning Done Right in Model Editing: This paper reveals that the underestimation of fine-tuning in model editing stems from an incorrect training pipeline (depth-first, sample-by-sample optimization). By correcting it to standard breadth-first mini-batch training and combining it with localized parameter updates, the proposed LocFT-BF achieves, for the first time, support for 100K sequential edits and models up to 72B parameters.
GOT-Edit: Geometry-Aware Generic Object Tracking via Online Model Editing: GOT-Edit integrates 3D geometric information from VGGT into a 2D generic object tracker via null-space-constrained online model editing, enhancing geometric awareness while preserving semantic discriminability, and achieving significant tracking improvements in occlusion and cluttered-background scenarios.
PICS: Pairwise Image Compositing with Spatial Interactions: This paper proposes PICS—a parallel pairwise image compositing method that simultaneously composes two objects in a single inference pass via mask-guided MoE and adaptive α-blending within an Interaction Transformer, explicitly modeling spatial interactions such as occlusion and contact, and consistently outperforming existing sequential compositing methods.
Rote Learning Considered Useful: Generalizing over Memorized Data in LLMs: This paper proposes a "memorize-then-generalize" framework that employs a two-stage strategy—first memorizing factual associations via semantics-free synthetic tokens through rote learning, then fine-tuning with a small number of semantic prompts—to demonstrate that LLMs can generalize from rote-memorized data. Deeper memorization yields better generalization, and the paper further identifies security risks arising from potential malicious exploitation of this mechanism.
Rote Learning Considered Useful: Generalizing over Memorized Training Examples: This paper proposes a two-stage "memorize-then-generalize" framework, demonstrating that LLMs can generalize effectively after rote-memorizing synthetic key tokens, requiring only minimal semantic fine-tuning — thereby challenging the conventional view that memorization impedes generalization.
When Large Multimodal Models Confront Evolving Knowledge: Challenges and Explorations: This paper proposes the EVOKE benchmark to systematically evaluate the ability of Large Multimodal Models (LMMs) to incorporate evolving knowledge, identifies two core challenges (poor performance of existing methods and catastrophic forgetting induced by fine-tuning), and explores two mitigation strategies: knowledge augmentation and continual learning.

🎯 Object Detection¶

AdaRank: Adaptive Rank Pruning for Enhanced Model Merging: AdaRank is proposed to adaptively select singular components of task vectors via learnable binary masks (replacing heuristic top-k selection), combined with test-time entropy minimization, substantially alleviating inter-task interference in multi-task model merging and achieving 89.4% accuracy on ViT-B/32.
CGSA: Class-Guided Slot-Aware Adaptation for Source-Free Object Detection: This paper is the first to introduce Object-Centric Learning (Slot Attention) into Source-Free Domain-Adaptive Object Detection (SF-DAOD). It extracts domain-invariant object-level structural priors via a hierarchical slot-aware module and drives domain-invariant representations through class-guided contrastive learning, achieving substantial improvements over existing methods across multiple cross-domain benchmarks.
CORDS: Continuous Representations of Discrete Structures: CORDS is a framework that bijectively maps variable-size discrete sets (detection boxes, molecular atoms) to continuous density and feature fields, enabling models to learn in field space and decode back to discrete sets exactly — without the constraints of fixed slots or padding.
ForestPersons: A Large-Scale Dataset for Under-Canopy Missing Person Detection: ForestPersons is the first large-scale benchmark dataset specifically designed for under-canopy missing person detection in forest environments (96,482 images + 204,078 annotations). By simulating the low-altitude flight perspective of micro aerial vehicles (MAVs) at 1.5–2.0 meters, the dataset covers multi-season, multi-weather, multi-pose, and multi-occlusion-level conditions representative of real search-and-rescue (SAR) scenarios, providing a solid foundation for training and evaluating under-canopy person detection models.
FSOD-VFM: Few-Shot Object Detection with Vision Foundation Models and Graph Diffusion: This paper proposes a training-free few-shot object detection framework that combines three foundation models—UPN, SAM2, and DINOv2—for proposal generation and feature matching, and employs a graph diffusion algorithm to refine confidence scores and suppress fragmented proposals. The method achieves substantial improvements over prior state-of-the-art on Pascal-5i and COCO-20i.
InfoDet: A Dataset for Infographic Element Detection: This paper introduces a large-scale infographic element detection dataset (101,264 infographics, 14.2 million annotations) spanning two major categories—chart elements and human-recognizable objects (HROs)—and proposes a Grounded CoT method that leverages detection results to enhance VLM chart understanding.
Long-Context Generalization with Sparse Attention: This paper proposes ASEntmax (Adaptive-Scalable Entmax), which replaces softmax attention with α-entmax equipped with a learnable temperature. Through both theoretical analysis and empirical evaluation, it demonstrates that sparse attention enables up to 1000× length extrapolation, addressing the attention dispersion problem of softmax under long-context settings.
SPWOOD: Sparse Partial Weakly-Supervised Oriented Object Detection: This paper proposes the SPWOOD framework to jointly address sparse and weak annotation (HBox/Point) in oriented object detection. Through a Self-Adaptive Oriented Detector (SAOD) and a spatial layout learning strategy, SPWOOD achieves near-fully-supervised performance on the DOTA benchmark under a mixed annotation setting (RBox:HBox:Point = 1:1:1).
Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Method: This paper proposes TreeBench (the first traceable visual reasoning benchmark comprising 405 highly challenging VQA pairs, on which OpenAI-o3 achieves only 54.87%) and TreeVGR (a training paradigm that jointly supervises grounding and reasoning via dual IoU reward-based reinforcement learning). A 7B model achieves gains of +16.8 on V*Bench, +12.6 on MME-RealWorld, and +13.4 on TreeBench, demonstrating that traceability is a key driver of visual reasoning advancement.

🧑 Human Understanding¶

BAH Dataset for Ambivalence/Hesitancy Recognition in Videos for Digital Behaviour Analysis: This paper introduces BAH, the first multimodal dataset for Ambivalence/Hesitancy (A/H) recognition in videos, comprising 1,118 video clips (8.26 hours total) from 224 participants across 9 Canadian provinces, annotated by behavioural science experts, with frame-level and video-level baseline experimental results provided.
Cross-Domain Policy Optimization via Bellman Consistency and Hybrid Critics: This paper proposes the Q Avatar framework, which quantifies the transferability of source-domain models via cross-domain Bellman consistency and combines source- and target-domain Q-functions through an adaptive, hyperparameter-free weighting function. The framework enables reliable knowledge transfer in cross-domain RL with mismatched state-action spaces, guaranteeing no negative transfer regardless of source model quality or domain similarity.
GaitSnippet: Gait Recognition Beyond Unordered Sets and Ordered Sequences: This paper proposes a Snippet paradigm that organizes gait silhouette sequences into several "snippets," each formed by randomly sampling frames from a contiguous interval. This design captures both short-range temporal context and long-range temporal dependencies, achieving 77.5% Rank-1 on Gait3D with a 2D convolution backbone, surpassing all 3D convolution methods.
Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals: This paper proposes TEMU-VTOFF, a Dual-DiT architecture for the Virtual Try-Off (VTOFF) task. A feature extractor and a garment generator collaborate in a division-of-labor design; Multimodal Hybrid Attention (MHA) fuses image, text, and mask signals to resolve visual ambiguity; and a DINOv2-driven garment aligner preserves high-frequency details. The method achieves state-of-the-art performance on both VITON-HD and the multi-category Dress Code benchmark.
NeuroGaze-Distill: Brain-informed Distillation and Depression-Inspired Geometric Priors for Robust Facial Emotion Recognition: This paper proposes NeuroGaze-Distill, a cross-modal distillation framework that extracts static Valence-Arousal prototypes from an EEG-trained teacher model and injects them into a purely visual student model via Proto-KD and depression-inspired geometric priors (D-Geo), improving cross-dataset robustness for facial expression recognition without requiring paired EEG-face data.
PersonaX: Multimodal Datasets with LLM-Inferred Behavior Traits: This work constructs the PersonaX multimodal dataset (comprising LLM-inferred Big Five behavior traits, facial embeddings, and biographical metadata) and proposes a two-level analysis framework: structured independence testing combined with unstructured causal representation learning (with theoretical identifiability guarantees), revealing cross-modal causal structures.
QuaMo: Quaternion Motions for Vision-based 3D Human Kinematics Capture: QuaMo proposes a 3D human kinematics capture method based on quaternion differential equations (QDE). By solving kinematic equations under the unit quaternion sphere constraint $\mathcal{S}^3$ and introducing a second-order acceleration-augmented meta-PD controller, the method achieves discontinuity-free, low-jitter online real-time human motion estimation, surpassing state-of-the-art methods on Human3.6M and several other benchmarks.
Visual Autoregressive Modeling for Instruction-Guided Image Editing: VAREdit reformulates instruction-guided image editing as a next-scale prediction problem. It proposes the Scale-Aligned Reference (SAR) module to resolve the scale mismatch between finest-scale conditioning and coarse target features. On EMU-Edit and PIE-Bench, the GPT-Balance score surpasses the strongest diffusion baseline by 64.9% and 45.3%, respectively, with 512×512 editing completed in only 1.2 seconds.

📡 Signal & Communications¶

Deterministic Bounds and Random Estimates of Metric Tensors on Neuromanifolds: By analyzing the spectral properties of the Fisher Information Matrix (FIM) in the low-dimensional kernel space of probability distributions, this paper establishes deterministic upper and lower bounds for the metric tensor on the neural network parameter space (neuromanifold), and introduces a family of unbiased stochastic estimators with bounded variance based on the Hutchinson trace estimator, computable efficiently with a single backward pass.
FASA: Frequency-Aware Sparse Attention: This paper identifies functional sparsity at the frequency component (FC) level within RoPE—a small subset of "dominant FCs" can effectively predict token importance. Based on this finding, the paper proposes the FASA framework, which achieves training-free KV cache compression via two stages: dominant-FC-based token importance prediction and focused attention computation. On LongBench, retaining only 256 tokens approaches 100% of full-KV performance; on AIME24, FASA achieves a 2.56× speedup using only 18.9% of the KV cache.
Group Representational Position Encoding (GRAPE): This paper proposes the GRAPE framework, which unifies the multiplicative (RoPE) and additive (ALiBi/FoX) families of positional encodings in Transformers via group actions, proves that RoPE and ALiBi are exact special cases, and introduces a path-integral additive variant GRAPE-AP that outperforms existing methods on downstream tasks.
Learning Molecular Chirality via Chiral Determinant Kernels: This paper proposes Chiral Determinant Kernels (ChiDeK) to encode SE(3)-invariant chiral matrices, achieving for the first time a unified treatment of both central and axial chirality within a GNN framework. Combined with cross-attention for propagating stereochemical information, the method achieves >7% accuracy improvement on a newly constructed axial chirality benchmark.
Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies: This paper proposes Multi-Agent System Search (MASS), a framework that automatically discovers high-performance multi-agent system (MAS) designs through a three-stage interleaved strategy of prompt and topology optimization: local prompt optimization → topology search → global prompt optimization.
Multi-modal Data Spectrum: Multi-modal Datasets are Multi-dimensional: Through a large-scale empirical study, this work quantifies intra-modal and inter-modal dependencies across 23 VQA benchmarks, revealing that most benchmarks contain severe unimodal shortcuts and that eliminating text bias tends to introduce image bias. A quantitative evaluation framework for multimodal benchmark design is proposed.
Robust Preference Alignment via Directional Neighborhood Consensus: This paper proposes Robust Preference Selection (RPS), a training-free inference-time method for improving preference alignment robustness. By sampling multiple candidate directions from the local neighborhood of a target preference and generating responses accordingly, then selecting the best response according to the original target preference, RPS achieves up to 69% win rate over baselines on OOD preferences.
Spectrum Tuning: Post-Training for Distributional Coverage and In-Context Steerability: This paper proposes Spectrum Tuning, a post-training method that trains language models on a distributional-fitting dataset spanning 90+ tasks, improving in-context steerability, output space coverage, and distributional alignment. It reveals that current instruction tuning systematically degrades in-context steerability.

🌐 Multilingual & Translation¶

ASSESS: A Semantic and Structural Evaluation Framework for Statement Similarity: This paper proposes the ASSESS framework, whose core contribution is the TransTED Similarity metric. By parsing formal mathematical statements into Operator Trees (OPTs) and augmenting standard Tree Edit Distance (TED) with semantic transformations driven by Lean proof tactics, ASSESS achieves state-of-the-art performance of 70.16% accuracy and a Kappa score of 0.35 on the EPLA benchmark, while requiring only CPU resources for reproduction.
ASSESS: A Semantic and Structural Evaluation Framework for Statement Similarity: This paper proposes the ASSESS framework and the TransTED Similarity metric, which parses formal statements into operator trees and incorporates semantic transformations into tree edit distance computation, achieving state-of-the-art evaluation of autoformalization statement similarity (70.16% accuracy, 0.35 Kappa). The paper also releases the EPLA benchmark comprising 1,247 expert-annotated statement pairs.
ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality: This paper proposes the Adaptive Transfer Scaling Law (ATLAS), which decomposes effective data volume into three components—target language, transfer languages, and other languages—and introduces a data repetition saturation function. Evaluated across 774 multilingual training experiments (10M–8B parameters, 400+ languages), ATLAS substantially outperforms existing scaling laws, improving multilingual $R^2$ from 0.67 to 0.98, and systematically quantifies the cross-lingual transfer matrix, capacity constraints underlying the curse of multilinguality, and the computational crossover point between pretraining and finetuning.
Multilingual Routing in Mixture-of-Experts: This paper systematically analyzes multilingual routing patterns in MoE large language models, finding that middle layers contain cross-lingually shared experts and that language performance is strongly correlated with alignment to English routing. Based on these findings, the authors propose an inference-time routing intervention that activates English task experts in middle layers, consistently improving multilingual performance by 1–2% across 3 models × 2 tasks × 15+ languages.
Prior-based Noisy Text Data Filtering: Fast and Strong Alternative For Perplexity: This paper proposes a text data filtering method based on token term frequency priors, detecting anomalous documents by computing the mean and standard deviation of token priors within each document. The approach achieves over 1000× speedup compared to PPL-based filtering while delivering superior downstream performance.
Prior-based Noisy Text Data Filtering: Fast and Strong Alternative for Perplexity: This paper proposes a text data filtering method based on token priors (token frequency statistics), using the mean and standard deviation of in-document token priors as a proxy for perplexity (PPL). The method achieves the highest average performance across 20 downstream benchmarks while being over 1000× faster than PPL-based filtering.
SASFT: Sparse Autoencoder-guided Supervised Finetuning to Mitigate Unexpected Code-Switching in LLMs: This paper uses Sparse Autoencoders (SAEs) to identify that unexpected code-switching in LLMs is associated with abnormally high pre-activation values of target-language features, and proposes SASFT, a method that constrains language feature pre-activation values during SFT training, reducing unexpected code-switching by over 50%.

🔎 AIGC Detection¶

Calibrating Verbalized Confidence with Self-Generated Distractors: This paper proposes DiNCo, a method that exposes the "suggestibility bias" of LLMs by having them independently evaluate automatically generated distractor options (plausible but incorrect alternative answers). It normalizes confidence using the total confidence assigned to distractors, and integrates two complementary dimensions—generation consistency and verification consistency—to significantly improve confidence calibration on both short-form QA and long-form generation tasks.
CLARC: C/C++ Benchmark for Robust Code Search: This paper introduces CLARC, the first compilable C/C++ code retrieval benchmark comprising 6,717 query–code pairs. An automated pipeline extracts code from GitHub and employs LLMs combined with hypothesis testing to generate and validate queries. The benchmark covers four retrieval scenarios—standard, anonymized, assembly, and WebAssembly—and reveals that existing code embedding models over-rely on lexical features (NDCG@10 drops from 0.89 to 0.67 after anonymization) and perform poorly on binary-level retrieval.
Death of the Novel(ty): Beyond n-Gram Novelty as a Metric for Textual Creativity: Through close reading annotations of 8,618 expressions by 26 professional writers, this paper demonstrates that n-gram novelty is insufficient for measuring textual creativity — approximately 91% of expressions with high n-gram novelty are not perceived as creative, and in open-source LLMs, high n-gram novelty negatively correlates with pragmaticality.
DMAP: A Distribution Map for Text: This paper proposes DMAP (Distribution Map), a mathematical framework that maps text to i.i.d. samples on $[0,1]$ via next-token probability rankings from a language model. A formal theorem proves that purely sampled text yields a uniform distribution, enabling $\chi^2$-based verification of generation parameters, exposing the root cause of the complete failure of probability-curvature detectors under pure sampling, and visualizing statistical fingerprints left by post-training (SFT/RLHF) in downstream models.
Is Your Paper Being Reviewed by an LLM? Benchmarking AI Text Detection in Peer Review: This paper constructs the largest AI-generated peer review dataset to date (788,984 reviews), systematically evaluates 18 AI text detection methods in the peer review setting, and proposes the Anchor detection method that leverages the source paper as contextual grounding, substantially outperforming all baselines at low false positive rates.
PoliCon: Evaluating LLMs on Achieving Diverse Political Consensus Objectives: The PoliCon benchmark is constructed from 2,225 high-quality deliberation records spanning 13 years (2009–2022) of the European Parliament. By designing diverse voting mechanisms (simple majority / two-thirds majority / veto power), power structures, and political objectives (utilitarianism / Rawlsianism), the benchmark systematically evaluates the ability of LLMs to draft political consensus resolutions, revealing the shortcomings of frontier models on complex consensus tasks and their inherent partisan biases.

🛰️ Remote Sensing¶

AutoFly: Vision-Language-Action Model for UAV Autonomous Navigation in the Wild: This paper proposes AutoFly, an end-to-end VLA model for UAV autonomous navigation in the wild. It infers spatial information from RGB inputs via a pseudo-depth encoder, and is trained on a newly constructed autonomous navigation dataset (13K+ trajectories including 1K real flights). AutoFly achieves 3.9% higher success rate and 2.6% lower collision rate than OpenVLA in both simulated and real environments.
Earth-Agent: Unlocking the Full Landscape of Earth Observation with Agents: Earth-Agent is the first Earth observation agent framework built upon an MCP-based tool ecosystem. It unifies RGB and spectral remote sensing data, dynamically invoking 104 expert tools to enable cross-modal, multi-step, and quantitative spatiotemporal reasoning. The accompanying Earth-Bench benchmark comprises 248 expert-curated tasks and 13,729 images. Experiments demonstrate that Earth-Agent substantially outperforms both general-purpose agents and remote sensing MLLMs.
Measuring the Intrinsic Dimension of Earth Representations: This work presents the first systematic measurement of the intrinsic dimension (ID) of Geographic Implicit Neural Representations (Geographic INR), finding that 256–512-dimensional embeddings have true IDs of only 2–10. Higher ID in frozen embedding spaces correlates positively with downstream performance, while lower ID in supervised task-head activation spaces correlates positively with performance, revealing a dual mechanism of "representativeness vs. task alignment."
Spectral Gaps and Spatial Priors: Studying Hyperspectral Downstream Adaptation Using TerraMind: This paper investigates whether TerraMind, a multimodal geospatial foundation model not pretrained on hyperspectral data, can be effectively adapted to hyperspectral downstream tasks via channel adaptation strategies (naive band selection vs. SRF-based grouping). Results demonstrate that naive band selection consistently outperforms the physically-informed SRF approach, with the performance gap widening as the spectral complexity of the task increases.
TAMMs: Change Understanding and Forecasting in Satellite Image Time Series with Temporal-Aware Multimodal Models: TAMMs is proposed as the first unified framework that jointly performs Temporal Change Description (TCD) and Future Satellite Image Forecasting (FSIF) within a single MLLM-diffusion architecture. A Temporal Adaptation Module (TAM) awakens the temporal reasoning capability of a frozen MLLM, while a Semantic Fusion Control Injection (SFCI) mechanism converts change understanding into generative control signals.
Task-free Adaptive Meta Black-box Optimization: This paper proposes ABOM—a task-free adaptive meta black-box optimizer that eliminates the need for predefined training task distributions. By parameterizing evolutionary operators (selection, crossover, mutation) as differentiable attention modules and leveraging self-generated data for online parameter updates during optimization, ABOM achieves competitive zero-shot performance on synthetic benchmarks and UAV path planning tasks.

🗣️ Dialogue Systems¶

AQuA: Toward Strategic Response Generation for Ambiguous Visual Questions: This paper proposes AQuA, the first visual question answering dataset with fine-grained ambiguity grading across four levels (7.2K samples, 1.8K per level), defining an optimal response strategy for each level (direct answer / inference / enumeration / clarification request). The study finds that GPT-5 and Gemini over-confidently default to direct answers on ambiguous VQA instances, while a 3B model trained via SFT+GRPO surpasses closed-source large models in strategy adaptation.
Non-Collaborative User Simulators for Tool Agents: Drawing on four categories of non-collaborative user behavior from marketing research (unavailable service requests, tangential chit-chat, impatience, and incomplete utterances), this work constructs a goal-aligned simulation framework and systematically exposes the behavior-specific failure mechanisms of state-of-the-art tool agents on MultiWOZ and τ-bench. Tangential chit-chat causes an average success rate (SR) drop of 29.1%, and distinct model families exhibit qualitatively different failure modes—GPT-series models fall into repetitive helper API calls, while Qwen-series models tend to hallucinate API results.
ReIn: Conversational Error Recovery with Reasoning Inception: This paper proposes Reasoning Inception (ReIn), a test-time intervention method that requires no modification to model parameters or system prompts. An external inception module detects conversational errors and injects recovery plans into the task agent's reasoning chain, significantly improving task completion rates across diverse error scenarios while generalizing to unseen error types.
Think-While-Generating: On-the-Fly Reasoning for Personalized Long-Form Generation: FlyThinker proposes an efficient "think-while-generating" framework that employs a dedicated reasoning model (Reasoner) to generate latent reasoning signals in parallel at the token level, dynamically incorporating them into a generation model (Generator) to guide personalized long-form generation, while preserving both training and inference efficiency.
Understanding Language Prior of LVLMs by Contrasting Chain-of-Embedding: By contrasting layer-wise hidden representations (chain-of-embedding) with and without visual input, this paper identifies a "Visual Integration Point" (VIP) layer in LVLMs and proposes the Total Visual Integration (TVI) metric to quantify the strength of language priors.

⚛️ Physics¶

Feedback-driven Recurrent Quantum Neural Network Universality: This paper establishes the first quantitative approximation error bounds and universality proofs for feedback-based recurrent quantum neural networks (RQNNs), demonstrating that RQNNs can approximate arbitrary fading memory filters with a linear readout layer while requiring only $\lceil\log_2(\varepsilon^{-1})\rceil$ qubits — growing logarithmically with precision — and are thus free from the curse of dimensionality.
Sublinear Time Quantum Algorithm for Attention Approximation: This paper proposes the first quantum data structure with sublinear time complexity in sequence length $n$ for approximating row queries of the Transformer attention matrix. The preprocessing time is $\widetilde{O}(\epsilon^{-1} n^{0.5} \cdot \text{poly}(d, s_\lambda, \alpha))$ and each row query takes $\widetilde{O}(s_\lambda^2 + s_\lambda d)$, achieving a quadratic speedup over classical algorithms with respect to $n$.

🧠 Mixture of Experts¶

MoE-GS: Mixture of Experts for Dynamic Gaussian Splatting: This paper proposes MoE-GS, the first framework to introduce a Mixture-of-Experts architecture into dynamic Gaussian Splatting. Through a Volume-aware Pixel Router, it adaptively fuses multiple heterogeneous deformation priors (HexPlane / per-Gaussian / polynomial / interpolation), consistently surpassing state-of-the-art methods on the N3V and Technicolor datasets, while maintaining efficiency via single-pass rendering, gate-aware pruning, and knowledge distillation.

📂 Others¶

A Federated Generalized Expectation-Maximization Algorithm for Mixture Models with an Unknown Number of Components: This paper proposes FedGEM, an algorithm in which clients perform local EM steps and construct uncertainty sets, while the server detects cluster overlap via set intersections and infers the global number of clusters. FedGEM is the first method to achieve federated clustering with an unknown global cluster count $K$, and comes with probabilistic convergence guarantees.
A Representer Theorem for Hawkes Processes via Penalized Least Squares Minimization: This paper establishes a novel representer theorem for estimating triggering kernels of linear multivariate Hawkes processes within the RKHS framework, proving that the optimal estimator admits a finite representation as a linear combination of equivalent kernels evaluated at data points, with all dual coefficients analytically equal to 1. This eliminates the need to solve a dual optimization problem, enabling efficient and scalable nonparametric estimation.
A Scalable Inter-edge Correlation Modeling in CopulaGNN for Link Sign Prediction: This paper extends CopulaGNN from the node level to the edge level for link sign prediction on signed graphs. By constructing the correlation matrix as the Gramian of edge embeddings and reformulating the conditional distribution via the Woodbury identity, the proposed method achieves scalable modeling of inter-edge statistical dependencies.
A Single Architecture for Representing Invariance Under Any Space Group: A single architecture (Crystal Fourier Transformer) is proposed that adapts to the invariance requirements of any space group. By analytically deriving the constraints imposed by group operations on Fourier coefficients, the method constructs symmetry-adapted Fourier bases and achieves parameter sharing and zero-shot generalization across all 230 space groups via a dual graph representation of these constraints.
Active Learning for Decision Trees with Provable Guarantees: This paper establishes the first theoretical guarantees for active learning with decision trees: (1) the first explicit analysis of the disagreement coefficient for decision trees, yielding an $O(\ln^{OPT}(n))$ upper bound; (2) the first active learning algorithm for binary classification achieving a multiplicative error guarantee of $(1+\epsilon)$. Combining these two results yields polylogarithmic label complexity in the dataset size.
Addressing Divergent Representations from Causal Interventions on Neural Networks: This paper systematically demonstrates that causal interventions (activation patching, DAS, SAEs, etc.) push model internal representations off their natural distribution. It theoretically distinguishes "benign shifts" from "harmful shifts," proposes the Counterfactual Latent (CL) loss to constrain intervened representations to remain near the natural manifold, and validates on a 7B LLM that this approach reduces divergence while preserving intervention accuracy.
Agnostics: Learning to Synthesize Code in Any Programming Language with a Universal RL Environment: This paper proposes Agnostics, a language-agnostic post-training pipeline that reformulates programming tasks as I/O behavioral specifications and trains LLMs with a universal verifier and GRPO-based reinforcement learning to generate code in any programming language. The approach enables a Qwen 4B model to match the performance of 16B–70B models on five low-resource languages: Lua, Julia, R, OCaml, and Fortran.
An Efficient, Provably Optimal Algorithm for the 0-1 Loss Linear Classification Problem: This paper proposes the Incremental Cell Enumeration (ICE) algorithm — the first standalone algorithm with rigorous correctness proofs — capable of exactly solving the global optimum of the 0-1 loss linear classification problem in $O(N^{D+1})$ time, with extensions to polynomial hypersurface classification.
An Information-Theoretic Framework For Optimizing Experimental Design To Distinguish Probabilistic Neural Codes: This paper proposes the information gap, an information-theoretic measure that quantifies the ability of a given experimental design to distinguish between likelihood coding and posterior coding hypotheses. By deriving closed-form expressions for the cross-entropy performance difference between decoders under each hypothesis—shown to be the KL divergence between the true posterior and a task-marginalized surrogate posterior—the framework enables theory-driven optimal experimental design by maximizing this measure over the stimulus prior distribution.
ANO: Faster is Better in Noisy Landscapes: This paper proposes the Ano optimizer, which decouples the update direction from its magnitude — the direction is determined by the sign of the momentum for noise robustness, while the magnitude is determined by the instantaneous gradient absolute value (rather than the momentum magnitude) for responsiveness. Combined with an improved Yogi-style variance estimator, Ano significantly outperforms Adam/Lion/Adan in noisy and non-stationary environments (e.g., RL), while remaining competitive on standard tasks.
AnyUp: Universal Feature Upsampling: AnyUp proposes the first encoder-agnostic learnable feature upsampling method. Through feature-agnostic convolutional layers and window attention mechanisms, it requires only a single training pass to perform high-quality upsampling of arbitrary visual features across arbitrary resolutions, achieving state-of-the-art performance on semantic segmentation, depth estimation, and related tasks.
Articulation in Motion: Prior-Free Part Mobility Analysis for Articulated Objects: This paper proposes AiM (Articulation in Motion), a framework that reconstructs articulated objects from interaction videos and initial-state scans without requiring prior knowledge of the number of parts. It achieves dynamic-static decoupling via a dual-Gaussian representation (Static GS + Deformable GS), combines sequential RANSAC for prior-free part segmentation and joint estimation, and incorporates an SDMD module to handle newly exposed static regions. On complex 6-part objects (Storage), AiM achieves 79.34% mean IoU, substantially outperforming the prior-dependent ArtGS (52.23%).
Bayesian Influence Functions for Hessian-Free Data Attribution: This paper proposes the Local Bayesian Influence Function (BIF), which replaces the intractable Hessian inversion in classical influence functions with a covariance estimate obtained via SGLD sampling, enabling architecture-agnostic data attribution for models with billions of parameters and achieving state-of-the-art performance on retraining experiments.
Beyond Linearity in Attention Projections: The Case for Nonlinear Queries: Motivated by the theoretical finding of algebraic redundancy in $W_Q$, this work replaces the linear Query projection with a nonlinear residual form $Q(X)=(X+f_\theta(X))/2$, outperforming a baseline with +12.5% more parameters while keeping parameter count unchanged.
CaDrift: A Time-dependent Causal Generator of Drifting Data Streams: This paper proposes CaDrift, a time-dependent synthetic data stream generation framework based on structural causal models (SCMs). It introduces temporal correlation via EWMA smoothing and autoregressive noise, and realizes controllable distributional drift, covariate drift, severe drift, and local drift by modifying causal mapping functions. CaDrift fills the gap left by existing data stream generators that lack both causal structure and temporal dependence.
cadrille: Multi-modal CAD Reconstruction with Reinforcement Learning: cadrille is the first multi-modal CAD reconstruction model capable of handling point cloud, multi-view image, and text inputs simultaneously. Through a three-stage training paradigm of VLM backbone + SFT + RL fine-tuning, it achieves state-of-the-art performance across 10 CAD reconstruction benchmarks, with RL fine-tuning reducing the invalid rate to near 0%.
Characterizing and Optimizing the Spatial Kernel of Multi Resolution Hash Encodings: This paper analyzes Instant-NGP's multi-resolution hash encoding (MHE) through the lens of physical systems, deriving a closed-form approximation of its point spread function (PSF). The analysis reveals that the effective resolution is governed by the geometric mean resolution $N_{\text{avg}}$ rather than the finest resolution $N_{\max}$, and that axis-aligned grids introduce spatial anisotropy. The paper further proposes Rotated MHE (R-MHE), a zero-overhead method that eliminates anisotropy by applying a distinct rotation to the input coordinates at each hash level.
Chart Deep Research in LVLMs via Parallel Relative Policy Optimization: This paper proposes PRPO (Parallel Relative Policy Optimization), which addresses GRPO's training bottlenecks under multi-dimensional reward interference and heterogeneous data gradient conflicts through two-level parallel decoupled optimization — across reward dimensions and data types. It also introduces MCDR-Bench, which leverages an "error uniqueness principle" to transform subjective generation evaluation into objective error identification, enabling quantitative assessment of chart deep research capabilities.
CHLU: The Causal Hamiltonian Learning Unit as a Symplectic Primitive for Deep Learning: CHLU is a computational learning primitive grounded in relativistic Hamiltonian mechanics and symplectic integration. By enforcing phase-space volume conservation and introducing a causal velocity upper bound, it addresses gradient explosion/vanishing in LSTMs and information dissipation in Neural ODEs, achieving infinite-horizon stability and thermodynamic generative capability.
Completing Missing Annotation: Multi-Agent Debate for Accurate and Scalable Relevance Assessment: This paper proposes DREAM — a multi-agent, multi-round debate framework with opposing-stance initialization for IR relevance annotation: cases with consensus are automatically labeled, while disagreements are escalated to human annotators (aided by debate history). DREAM achieves 95.2% balanced accuracy with only 3.5% human escalation. Based on this framework, the BRIDGE benchmark is constructed, uncovering 29,824 missing relevant annotations absent from existing benchmarks (428% of the original annotations), and correcting ranking bias in retrieval systems as well as retrieval-generation performance misalignment in RAG evaluation.
Compositional Diffusion with Guided Search for Long-Horizon Planning: This paper proposes CDGS (Compositional Diffusion with Guided Search), which embeds a population-based search mechanism—iterative resampling combined with likelihood-based pruning—into the diffusion denoising process to address the mode averaging problem arising from the composition of multimodal local distributions. CDGS enables sampling of globally consistent long-horizon plans from short-horizon models without long-horizon training data.
Condition Matters in Full-head 3D GANs: This paper identifies that view conditioning in full-head 3D GANs introduces severe directional bias—generation quality is substantially higher at the conditioned viewpoint than at others. To address this, the authors propose replacing view conditioning with view-invariant semantic features (frontal CLIP features) and introduce BalanceHead360, a synthetic dataset of 11.2 million 360° full-head images generated via Flux.1 Kontext, achieving for the first time high-fidelity, diverse full-head generation with consistent quality across all viewpoints.
Consistent Low-Rank Approximation: This paper formalizes and systematically studies the consistent low-rank approximation problem—maintaining a near-optimal rank-$k$ approximation of a matrix whose rows arrive in a stream while minimizing the total variation (recourse) of the solution. It proves that $O(k/\varepsilon \cdot \log(nd))$ recourse is achievable under additive error, $k^{3/2}/\varepsilon^2 \cdot \text{polylog}$ recourse is achievable under multiplicative $(1+\varepsilon)$ error, and establishes a lower bound of $\Omega(k/\varepsilon \cdot \log(n/k))$.
Directional Sheaf Hypergraph Networks: Unifying Learning on Directed and Undirected Hypergraphs: This paper proposes Directional Sheaf Hypergraph Networks (DSHN), which combines Cellular Sheaf theory with the directional information of directed hypergraphs to construct a complex-valued Hermitian Laplacian operator. The proposed operator unifies and generalizes existing graph and hypergraph Laplacians, achieving 2%–20% relative accuracy improvements over baselines on 7 real-world datasets.
Distributed Algorithms for Euclidean Clustering: This paper constructs $(1+\varepsilon)$-coresets for Euclidean $(k,z)$-clustering in the distributed setting, achieving communication complexity that matches known lower bounds (up to polylogarithmic factors) in both the coordinator model and the blackboard model.
Distributionally Robust Classification for Multi-Source Unsupervised Domain Adaptation: This paper proposes a distributionally robust learning framework that jointly models uncertainty over both the target-domain covariate distribution and the conditional label distribution, achieving significant generalization improvements in UDA settings where target data is extremely scarce or spurious correlations exist in the source domain.
DA-AC: Distributions as Actions — A Unified RL Framework for Diverse Action Spaces: DA-AC proposes treating the parameters of an action distribution (e.g., softmax probabilities or Gaussian mean/variance) as the agent's output "actions," relocating the action sampling process to the environment side. This enables a unified deterministic policy gradient framework for discrete, continuous, and hybrid action spaces. The approach is theoretically proven to achieve strictly lower variance than LR and RP estimators, and attains competitive or state-of-the-art performance across 40+ environments.
Enhancing Generative Auto-bidding with Offline Reward Evaluation and Policy Search: This paper proposes AIGB-Pearl, which introduces an offline trajectory evaluator and a KL-Lipschitz constrained score maximization scheme for generative auto-bidding. The framework enables generative models to safely surpass the performance ceiling imposed by static offline data under theoretical guarantees, achieving a significant GMV improvement of +3% on Taobao's real-world advertising system.
Entropic Confinement and Mode Connectivity in Overparameterized Neural Networks: This paper reveals that systematic growth of curvature along low-loss paths generates entropic barriers, such that even when the energy path is flat, SGD noise confines optimization dynamics to flat regions near minima—resolving the paradox of "mode-connected but dynamically isolated" solutions.
Evaluating GFlowNet from Partial Episodes for Stable and Flexible Policy-Based Training: This paper establishes a theoretical connection between state flow functions and policy value functions in GFlowNet, proposes the Subtrajectory Evaluation Balance (Sub-EB) objective for reliable value function learning, and enhances the stability and flexibility of policy-based GFlowNet training.
Exchangeability of GNN Representations with Applications to Graph Retrieval: This paper identifies that trained GNN node embeddings are exchangeable random variables along the feature dimension (i.e., $p(\mathbf{X}) = p(\mathbf{X}\pi)$ holds for any dimensional permutation $\pi$), and exploits this property to approximate transportation-distance-based (EMD/Wasserstein) graph similarity as Euclidean distance via dimension-wise sorting. A unified locality-sensitive hashing (LSH) framework, GraphHash, is constructed upon this foundation, consistently outperforming baselines including FourierHashNet, DiskANN, IVF, CORGII, and SWWL in AUC on subgraph matching and graph edit distance (GED) retrieval tasks, scaling to corpus sizes of one million graphs.
Fast and Stable Riemannian Metrics on SPD Manifolds via Cholesky Product Geometry: This paper reveals a simple product structure on the Cholesky manifold and, building upon it, proposes two fast and numerically stable SPD metrics (PCM and BWCM) with closed-form expressions for all Riemannian operators, achieving simultaneous improvements in accuracy, efficiency, and numerical stability for SPD deep learning.
FastLSQ: Solving PDEs in One Shot via Fourier Features with Exact Analytical Derivatives: By exploiting the cyclic closed-form derivative structure of sinusoidal basis functions, this work presents a one-shot PDE solver that requires neither automatic differentiation nor iterative training. It achieves $10^{-7}$ accuracy in 0.07s for linear PDEs and $10^{-8}$–$10^{-9}$ accuracy in under 9s for nonlinear PDEs, outperforming PINNs by thousands of times in speed and several orders of magnitude in accuracy.
Federated ADMM from Bayesian Duality: This paper derives a Bayesian dual structure for ADMM from a variational Bayes (VB) perspective, proving that classical ADMM is a special case of VB over isotropic Gaussian families. Two novel extensions are introduced: a Newton-like variant (one-round convergence on quadratic objectives) and an Adam-like variant (IVON-ADMM, achieving +7% accuracy in heterogeneous deep learning settings).
FIRE: Frobenius-Isometry Reinitialization for Balancing the Stability-Plasticity Tradeoff: This paper formalizes the stability-plasticity tradeoff in continual learning as a constrained optimization problem—minimizing weight deviation (stability) subject to an orthogonality constraint (plasticity)—yielding a closed-form solution to the orthogonal Procrustes problem, $\tilde{W}^* = W(W^\top W)^{-1/2}$ (polar decomposition), implemented efficiently via Newton-Schulz iteration (<1% additional time). FIRE comprehensively outperforms baselines such as S&P across visual continual learning, LLM continual pre-training, and RL.
From Movement to Cognitive Maps: RNNs Reveal How Locomotor Development Shapes Hippocampal Spatial Coding: By combining cluster analysis of infant rodent locomotor development with a shallow RNN predictive learning model, this work provides the first computational demonstration that developmental changes in movement statistics (crawling → walking → running → adult) drive the sequential emergence of spatially tuned hippocampal neurons (place cells, head direction cells, and conjunctive coding cells). The model quantitatively reproduces the developmental timeline observed in rat hippocampal recordings and predicts a progressive increase in conjunctive place-HD coding cells during development — a prediction subsequently validated in experimental data.
Harpoon: Generalised Manifold Guidance for Conditional Tabular Diffusion: This paper extends manifold theory from image to tabular diffusion models, proving that the gradient of any differentiable inference-time loss lies in the tangent space of the data manifold (beyond the square-error loss restriction). Based on this result, the proposed Harpoon method guides unconditional samples at inference time along the manifold to satisfy diverse tabular constraints.
HEEGNet: Hyperbolic Embeddings for EEG: This work presents the first systematic empirical validation that EEG data exhibits hyperbolicity (hierarchical structure), and proposes HEEGNet, a hybrid hyperbolic network architecture. The model combines a Euclidean encoder for spatiotemporal-spectral feature extraction with a hyperbolic encoder for capturing hierarchical relationships, augmented by a novel coarse-to-fine domain adaptation strategy (DSMDBN). HEEGNet achieves state-of-the-art performance across multiple cross-domain tasks spanning visual evoked potentials, emotion recognition, and intracranial EEG.
Hilbert-Guided Sparse Local Attention: By reordering 2D image tokens into a 1D sequence via Hilbert space-filling curves—which preserve spatial locality—this work substantially increases the empty-block ratio in local attention (from 87.5% to 96.9%), enabling 4× speedup for window attention and 18× for sliding-window attention via FlexAttention, with negligible accuracy loss.
Implicit Bias of Per-sample Adam on Separable Data: Departure from the Full-batch Regime: This paper provides the first proof that mini-batch Adam exhibits a different implicit bias from its full-batch counterpart: a constructed dataset causes per-sample Adam to converge to an $\ell_2$ maximum-margin classifier (whereas full-batch Adam converges to $\ell_\infty$), and a proxy algorithm, AdamProxy, is introduced to characterize data-adaptive Mahalanobis-norm margin maximization on general datasets.
In-Context Algebra: This paper introduces an in-context algebra task—where tokens serve as pure variables and each sequence randomly reassigns their meanings—and finds that Transformers in this setting no longer learn classical Fourier/geometric representations. Instead, three symbolic reasoning mechanisms emerge (commutative copying, identity element recognition, and closure-based cancellation), with these capabilities appearing sequentially as phase transitions during training.
Jackpot: Optimal Budgeted Rejection Sampling for Extreme Actor-Policy Mismatch RL: This paper proposes the Jackpot framework, which applies Optimal Budget Rejection Sampling (OBRS) to accept or reject rollout tokens at the token level within a controllable acceptance budget, and reweights the remaining samples. The method is theoretically proven to strictly reduce the KL divergence between the actor and policy under any budget. Combined with joint training and distillation of the rollout model, Jackpot enables a small model (e.g., Qwen3-1.7B) to serve as the rollout model for training a large model (e.g., Qwen3-8B), achieving performance close to the on-policy baseline.
Key and Value Weights Are Probably All You Need: On the Necessity of the Query, Key, and Value Weight Triplet in Self-Attention: This paper theoretically demonstrates that the Query/Key/Value weight triplet in Transformer self-attention is redundant — the Query weight matrix can be replaced by an identity matrix (reducing attention parameters by 25%). GPT-style models trained from scratch confirm that performance is preserved under appropriate hyperparameter adjustment, and training remains stable at 3× lower weight decay, suggesting implicit regularization.
Latent Equivariant Operators for Robust Object Recognition: Promises and Challenges: This paper proposes learning or predefining equivariant shift operators in latent space to handle group transformations such as rotation and translation. At inference time, transformation parameters are estimated via KNN search, and inputs are mapped back to a canonical pose before classification. Experiments on MNIST demonstrate successful extrapolation to out-of-training-range transformations, offering greater flexibility than standard networks and equivariant networks, though scaling to more complex datasets remains an open challenge.
Latent Fourier Transform: This paper proposes LatentFT, a framework that applies the Discrete Fourier Transform (DFT) to latent time-series representations produced by a diffusion autoencoder, decomposing musical patterns by timescale. During training, a correlated log-scale frequency mask is randomly applied so that the decoder learns to reconstruct audio from partial spectra. At inference time, users specify frequency masks to selectively preserve or blend musical elements across different timescales. LatentFT consistently outperforms baselines including ILVR, Guidance, Codec Filtering, and RAVE on conditional generation and music blending tasks, with its superior audio quality and blending capability statistically confirmed by a listening test involving 29 musicians.
LPWM: Latent Particle World Models for Object-Centric Stochastic Dynamics: LPWM is the first self-supervised object-centric world model that scales to real-world multi-object datasets. Its core innovation is learning independent per-particle latent action distributions ($z_c^m$) for each particle, encoding all frames in parallel via a causal spatiotemporal Transformer, supporting diverse conditioning signals (actions, language, image goals, multi-view), achieving state-of-the-art video prediction, and demonstrating imitation learning capability (89% success rate on OGBench task3).
Learning Adaptive Distribution Alignment with Neural Characteristic Function for Graph Domain Adaptation: ADAlign is proposed as a framework that leverages neural characteristic functions to adaptively align source/target graph distributions in the spectral domain — eliminating the need for manual selection of alignment criteria by automatically identifying the most prominent distributional discrepancies in each transfer scenario. It achieves state-of-the-art performance across 16 transfer tasks on 10 datasets while reducing memory consumption and training time.
Learning on a Razor's Edge: Identifiability and Singularity of Polynomial Neural Networks: Using tools from algebraic geometry, this paper systematically analyzes MLPs and CNNs with polynomial activations: it proves finite identifiability for MLPs and unique identifiability for CNNs, reveals that sparse subnetworks correspond to singular points of the neuromanifold, and provides a geometric explanation of the sparsity bias in MLPs via the notion of "critical exposure"—a property that CNNs do not possess.
Learning Structure-Semantic Evolution Trajectories for Graph Domain Adaptation: This paper proposes DiffGDA—the first method to introduce diffusion models into graph domain adaptation (GDA). It formulates the continuous-time joint structure-semantic evolution from source graphs to target graphs using stochastic differential equations (SDEs), and employs a density-ratio-based domain-aware guidance network to steer the diffusion trajectory toward the target domain. Theoretical convergence to the optimal adaptation path is proven, and DiffGDA comprehensively outperforms state-of-the-art methods across 14 transfer tasks on 8 real-world datasets.
LipNeXt: Scaling up Lipschitz-based Certified Robustness to Billion-parameter Models: This paper proposes LipNeXt—the first unconstrained, convolution-free 1-Lipschitz architecture—which learns orthogonal matrices via manifold optimization and achieves spatial mixing through a theoretically motivated Spatial Shift Module derived from Theorem 1. LipNeXt scales to billion-parameter models and establishes new state-of-the-art certified robust accuracy (CRA) on CIFAR-10/100, Tiny-ImageNet, and ImageNet, with a +8% CRA gain on ImageNet at $\varepsilon=1$.
Lipschitz Bandits with Stochastic Delayed Feedback: This paper provides the first systematic study of Lipschitz bandits over continuous arm spaces under stochastic delayed feedback. For bounded delays, it proposes the Delayed Zooming algorithm, which employs a lazy update mechanism to maintain the suboptimality gap bound $\Delta(x) \leq 6r_t(x)$. For unbounded delays, it proposes DLPP, a phased pruning strategy whose regret is tied to the delay quantile $Q(p)$. Instance-dependent lower bounds are established to prove that DLPP is nearly optimal.
Missing Mass for Differentially Private Domain Discovery: This paper revisits the differentially private domain discovery problem through the lens of missing mass, providing the first near-optimal $\ell_1$ missing mass upper bounds for the simple and scalable Weighted Gaussian Mechanism (WGM) on Zipfian data, as well as distribution-free $\ell_\infty$ missing mass guarantees. WGM is further applied as a domain discovery preprocessing step for private top-$k$ and $k$-hitting set problems over unknown domains, with theoretical results validated on six real-world datasets.
Neural Force Field: Few-shot Learning of Generalized Physical Reasoning: This paper proposes Neural Force Field (NFF), which models object interactions as continuous force fields. A neural operator learns the force field function, and an ODE integrator decodes trajectories from it. NFF achieves few-shot state-of-the-art on three benchmarks—I-PHYRE (100 trajectories), N-body (200 trajectories), and PHYRE (0.012M samples, 267× fewer than prior SOTA)—reducing cross-scenario RMSE by 32–64% and achieving near-human performance on planning tasks.
Neuro-Symbolic Decoding of Neural Activity: This paper proposes NEURONA, a neuro-symbolic framework for fMRI decoding and concept grounding. By decomposing visual scenes into symbolic programs (logical combinations of concepts), NEURONA substantially outperforms both end-to-end neural decoders and linear models on fMRI question-answering tasks.
Noisy-Pair Robust Representation Alignment for Positive-Unlabeled Learning: This paper proposes NcPU, a non-contrastive PU learning framework that applies a sqrt transformation to the standard non-contrastive loss (NoiSNCL) so that clean-pair gradients dominate training, and introduces PhantomGate to provide conservative negative supervision with a regret rollback mechanism. Both modules iterate in a mutually beneficial manner under an EM framework. Without relying on auxiliary negative samples or pre-estimated class priors, NcPU narrows the gap with supervised learning from 14.26% to <1.4% on CIFAR-100, and achieves SOTA on xBD disaster damage assessment as well.
On the Impact of the Utility in Semivalue-based Data Valuation: This paper introduces a geometric representation termed spatial signature to unify the modeling of utility selection in data valuation as a directional rotation problem on the unit circle. It further proposes a robustness metric $R_p$ and demonstrates that the Banzhaf value exhibits the highest ranking stability across different utility functions.
On the Lipschitz Continuity of Set Aggregation Functions and Neural Networks for Sets: This paper systematically investigates the Lipschitz continuity of three commonly used set aggregation functions (sum, mean, max) and attention mechanisms under three multiset distance functions, derives upper bounds on the Lipschitz constants of set neural networks, and connects these results to perturbation stability and generalization under distribution shift.
Out of the Shadows: Exploring a Latent Space for Neural Network Verification: By interpreting zonotopes as "shadows" (projections) of high-dimensional hypercubes, this paper identifies that the input set and output enclosure share a common latent space. Building on this insight, it proposes a specification-driven input refinement method that back-propagates unsafe output constraints into the input space to prune subproblems, reducing branch-and-bound subproblem counts by 60–65%. All operations are matrix-based, enabling efficient GPU acceleration. The method achieves competitive performance with top-tier tools such as α-β-CROWN across eight VNN-COMP'24 benchmarks.
Oversmoothing, Oversquashing, Heterophily, Long-Range, and More: Demystifying Common Beliefs in Graph Machine Learning: This paper systematically examines nine common beliefs in graph machine learning concerning oversmoothing, oversquashing, homophily/heterophily, and long-range dependencies. Through concise counterexamples, each belief is refuted. Notably, "oversquashing" is decomposed into two independent concepts—computational bottleneck and topological bottleneck—thereby clarifying widespread conceptual confusion in the field.
OwlEye: Zero-Shot Learner for Cross-Domain Graph Data Anomaly Detection: This paper proposes OwlEye, a framework that aligns heterogeneous graph embeddings into a shared space via pairwise-distance-statistics-based cross-domain feature alignment, extracts attribute-level and structure-level normal patterns from multiple graphs into an extensible dictionary, and detects anomalous nodes in unseen graphs under fully zero-shot conditions through a truncated attention-based reconstruction mechanism. OwlEye achieves an average AUPRC of 36.17% across 8 datasets, surpassing the strongest baseline ARC by approximately 5.4 percentage points.
Predicting Kernel Regression Learning Curves from Only Raw Data Statistics: This paper proposes the Hermite Eigenstructure Ansatz (HEA), which analytically predicts the learning curves (test error vs. sample size) of rotation-invariant kernels on real image datasets (CIFAR-5m, SVHN, ImageNet) using only two statistics: the data covariance matrix and the Hermite decomposition of the target function. The paper proves that HEA holds for Gaussian data and empirically demonstrates that MLPs in the feature-learning regime also learn Hermite polynomials in the order predicted by HEA.
Probabilistic Kernel Function for Fast Angle Testing: This paper studies the angle testing problem in high-dimensional Euclidean space and proposes two deterministic probabilistic kernel functions, $K_S^1$ and $K_S^2$, based on reference angles for angle comparison and angle threshold judgment, respectively. Theoretical guarantees are obtained without relying on asymptotic assumptions of Gaussian distributions. Applied to approximate nearest neighbor search (ANNS), the method achieves 2.5×–3× QPS speedup on HNSW graphs.
Refine Now, Query Fast: A Decoupled Refinement Paradigm for Implicit Neural Fields: This paper proposes the Decoupled Representation Refinement (DRR) paradigm, which employs a deep refiner network to offline-refine the embedding structure and cache the results, so that the inference stage requires only fast interpolation and a lightweight decoder. On ensemble simulation surrogate modeling tasks, DRR-Net achieves state-of-the-art reconstruction accuracy at less than 1/27 of the inference cost.
Revisiting Sharpness-Aware Minimization: A More Faithful and Effective Implementation: This paper proposes a new intuitive interpretation of SAM's underlying mechanism — that the gradient at the perturbed point approximates the direction toward the local maximum — and reveals its imprecision as well as the multi-step degradation problem. It then introduces XSAM, which achieves more faithful and effective sharpness-aware minimization by explicitly estimating the direction of the maximum.
Scalable Random Wavelet Features: Efficient Non-Stationary Kernel Approximation with Convergence Guarantees: This paper proposes Random Wavelet Features (RWF), a scalable non-stationary kernel approximation framework constructed by randomly sampling from a family of wavelets. RWF preserves the linear-time complexity of random feature methods while offering guarantees of positive definiteness, unbiasedness, and uniform convergence.
SEED: Towards More Accurate Semantic Evaluation for Visual Brain Decoding: This paper proposes SEED (Semantic Evaluation for Visual Brain Decoding), a composite evaluation metric combining three complementary measures — Object F1, Cap-Sim, and EffNet — which substantially outperforms all existing metrics in alignment with human evaluation.
Speculative Actions: A Lossless Framework for Faster AI Agents: Inspired by CPU speculative execution and LLM speculative decoding, this paper proposes the Speculative Actions framework: while a slow Actor (large model) computes, a fast Speculator (small model) predicts future actions and pre-executes them; upon a match, the waiting round is skipped, achieving lossless acceleration. The framework achieves 15–30% latency reduction across Chess, e-commerce, and QA scenarios. A confidence-based dynamic branching strategy attains acceleration comparable to three speculative branches while using 40% fewer tokens.
t-SNE Exaggerates Clusters, Provably: This paper provides rigorous theoretical proofs of two fundamental failure modes of t-SNE: (1) the strength of input clusters cannot be inferred from the output, and (2) extreme outliers cannot be faithfully represented — even when the input has no cluster structure or contains extreme outliers, t-SNE may produce perfectly clustered visualizations.
The Counting Power of Transformers: This paper proves that Transformers can express not only (semi-)linear counting properties but all semi-algebraic counting properties (i.e., Boolean combinations of multivariate polynomial inequalities), generalizing all prior results on the counting power of Transformers and deriving novel undecidability conclusions.
The Hot Mess of AI: How Does Misalignment Scale With Model Intelligence and Task Complexity?: This paper decomposes AI model errors into bias (systematic misalignment) and variance (incoherent behavior), finding that longer reasoning leads to greater incoherence, and that larger models become more incoherent on difficult tasks. This suggests that future superintelligent AI is more likely to exhibit unpredictable, "industrial accident"-style failures than to coherently pursue wrong objectives.
The Invisibility Hypothesis: Promises of AGI and the Future of the Global South: This paper introduces the Invisibility Hypothesis, arguing that as AI systems increasingly serve as the coordination layer for economic and political allocation, they will systematically favor "machine-readable" individuals. Informal workers in the Global South, lacking digital verifiability, face managed exclusion. The central risk shifts from job displacement to relevance loss, and this exclusion is self-reinforcing.
The Price of Robustness: Stable Classifiers Need Overparameterization: This paper establishes stability-generalization bounds for discontinuous classifiers and proves a "law of robustness" for classification: any interpolating classifier with $p \approx n$ parameters is necessarily unstable, and achieving high stability requires overparameterization on the order of $p \approx nd$.
ToProVAR: Efficient Visual Autoregressive Modeling via Tri-Dimensional Entropy-Aware Semantic Analysis and Sparsity Optimization: ToProVAR is a framework that employs attention entropy to uniformly analyze sparsity across three dimensions — token, layer, and scale — in VAR models, achieving up to 3.4× speedup with negligible image quality degradation, significantly outperforming FastVAR and SkipVAR.
Towards Sustainable Investment Policies Informed by Opponent Shaping: This paper formally proves the conditions under which the InvestESG simulation environment constitutes a social dilemma, and applies the Advantage Alignment opponent shaping algorithm to guide economic agents toward sustainable investment equilibria.
Training Deep Normalization-Free Spiking Neural Networks with Lateral Inhibition: This paper proposes DeepEISNN, a normalization-free learning framework based on cortical excitatory-inhibitory (E-I) circuits. Through two techniques—E-I Init and E-I Prop—it achieves stable end-to-end training of deep SNNs while balancing performance and biological plausibility.
When to Retrain after Drift: A Data-Only Test of Post-Drift Data Size Sufficiency: CALIPER proposes a detector- and model-agnostic, data-only test that estimates the minimum amount of post-drift data required for safe retraining after abrupt concept drift. It tracks the monotonic non-increasing trend of a surrogate error from weighted local regression (WLR) as the locality parameter $\theta$ increases, combined with an effective sample size (ESS) gate, without requiring any actual retraining of the downstream model.