🧪 ICML2026 Paper Notes¶
410 ICML2026 paper notes covering Multimodal VLM (30), Image Generation (22), Interpretability (21), Model Compression (21), LLM Reasoning (20), Reinforcement Learning (20), Scientific Computing (19), LLM Safety (18) and other 40 areas. Each note has TL;DR, motivation, method, experiments, highlights, and limitations — 5-minute reads of core ideas.
🧩 Multimodal VLM¶
- Any3D-VLA: Enhancing VLA Robustness via Diverse Point Clouds
-
Through a pilot study, the authors find that "explicitly lifting vision to point clouds and then fusing with 2D patches" is the most effective way to inject 3D information into VLA. To address the scarcity of 3D data and domain gaps among different point cloud sources (simulation/sensor/monocular estimation), they propose Any3D-VLA: using hybrid point cloud training to learn source-agnostic geometric representations, achieving a 29.2% improvement (62.5% vs 33.3%) over the strongest baseline in real-world zero-shot grasping tasks.
- Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning
-
This work enforces VLM outputs to be split into
<recognition>perception blocks and<think>reasoning blocks, then uses a "blindfolded" text-only reasoning agent (which cannot access the image, only the perception text written by the VLM) to determine if the question can be answered correctly, serving as the perception reward \(R_P\), paired with structured verbal verification (SVV) as the outcome reward \(R_O\). MoCA uses \(R_P\) as a gate for modality-level credit assignment, enabling a 7B model to achieve simultaneous improvements across 9 perception/reasoning/rich-modality benchmarks, surpassing GPT-4o on multiple metrics. - Breaking Dual Bottlenecks: Evolving Unified Multimodal Models into Self-Adaptive Interleaved Visual Reasoners
-
Addressing the "understanding–generation gap" in unified multimodal models on anything-to-image (X2I) tasks (can understand but cannot generate), this paper proposes the Self-Adaptive Interleaved Reasoner: a hierarchical data synthesis pipeline routes 50,000 samples among direct generation, self-reflection, and multi-step planning modes; SFT + GRPO training is applied with step-wise reasoning rewards and intra-group complexity penalties, enabling Emu3.5 to surpass GPT-4o, Gemini 2.5 Flash, and other closed-source models on KRIS-Bench and OmniContext.
- Calibrated Multimodal Representation Learning with Missing Modalities
-
Addressing the practical scenario of "training unified multimodal alignment with partial modality data such as V-T, A-T," this work theoretically establishes upper and lower bounds for "anchor shift caused by missing modalities" via singular value perturbation, and proposes CalMRL: a probabilistic PCA-style generative model performs closed-form EM imputation for missing modalities at the representation level, then feeds both observed and imputed representations into the SVD alignment objective of GRAM/PMRL. On VAST, cross-modal average Recall@1 is improved from 44.8 to 54.2 (+9.4).
- DCER: Robust Multimodal Fusion via Dual-Stage Compression and Energy-Based Reconstruction
-
DCER unifies "intra-modal frequency domain compression + cross-modal bottleneck token" as a robust fusion pipeline, employs a learned energy function for gradient-based reconstruction of missing modalities, and uses the final energy value as intrinsic uncertainty, achieving new SOTA on MOSI/MOSEI/SIMS.
- FreeRet: MLLMs as Training-Free Retrievers
-
FreeRet proposes a fully training-free two-stage multimodal retrieval framework: Stage 1 bypasses the MLLM's final MLP and uses controlled generation prompts to extract semantically faithful embeddings for candidate retrieval; Stage 2 reformulates reranking as a multiple-choice question to avoid LLM framing bias. On MMEB, it outperforms retrieval models trained on tens of millions of paired data.
- FRISM: Fine-Grained Reasoning Injection via Subspace-Level Model Merging for Vision–Language Models
-
FRISM refines "VLM × LRM merging" from the layer level to the SVD subspace level: it uses the SVD subspaces of LRM task vectors as reasoning priors, then employs an unlabeled self-distillation (with learnable gating only, KL for vision preservation + spectral norm maximization for reasoning absorption) to find the optimal injection strength, thereby significantly improving VL reasoning performance without notable vision degradation.
- Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs
-
This paper unifies quantization-aware training (QAT) and knowledge distillation (KD) from the Information Bottleneck (IB) perspective, proposing the GRACE framework (confidence-gated decoupled distillation + relational centered kernel alignment + adaptive IB controller). This enables INT4-quantized LLaVA / Qwen-VL models not only to avoid performance drops but to surpass BF16 baselines on multiple benchmarks, achieving 3× throughput and 54% memory savings in real-world deployment.
- Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression
-
This work reframes the "regression to the mean" issue of MLLM continuous value regression under long-tail distributions as a distribution-aware RL problem. Within the GRPO framework, the Concordance Correlation Coefficient (CCC) is used as a batch-level reward—simultaneously considering correlation, variance, and mean—thus explicitly penalizing prediction distribution collapse. On four long-tail regression tasks and Qwen2.5-VL-3B/7B, it consistently outperforms SFT, SoftLabel, and various point-wise RL methods, with especially significant MAE reductions in medium/few-shot regions.
- Instruction Lens Score: Your Instruction Contributes a Powerful Object Hallucination Detector for Multimodal Large Language Models
-
This work finds that the intermediate-layer embeddings of instruction tokens in MLLMs naturally filter misleading information introduced from the visual side. Based on this, a training-free InsLen score (Calibrated Local Score + Context Consistency Score) is proposed, which improves object hallucination detection AUROC by up to 13.81% across 5 MLLMs × 4 benchmarks.
- Large Vision-Language Models Get Lost in Attention
-
This paper quantitatively diagnoses the residual stream of LVLMs using a geometric information-theoretic framework based on "information complexity (eRank) + subspace support." It finds that attention almost exclusively performs reconfiguration within the subspace, while the FFN injects new semantic dimensions. Even more strikingly, replacing learned attention weights with Gaussian noise leads to equal or improved performance on most vision tasks, revealing severe misalignment and redundancy in visual attention of contemporary LVLMs.
- Learn to Think: Improving Multimodal Reasoning through Vision-Aware Self-Improvement Training
-
VISTA transforms self-improvement training for multimodal large models into a two-stage pipeline: "augmenting hard samples via prefix resampling, filtering pseudo-positive samples via Visual Attention Score (VAS)." On Qwen2.5-VL-3B, it achieves an average improvement of +13.66% on mathematical/medical multimodal reasoning.
- Less Precise Can Be More Reliable: A Systematic Evaluation of Quantization's Impact on VLMs Beyond Accuracy
-
Through 700,000 experimental runs covering 16 quantization methods × 10 VLMs × multiple reliability metrics, this work finds that quantization is not merely a destructive process—it suppresses high-rank, low-variance spectral components, thereby improving calibration, OOD detection, and noise robustness, but also amplifies reliance on covariate shift and spurious correlations.
- LIMSSR: LLM-Driven Sequence-to-Score Reasoning under Training-Time Incomplete Multimodal Observations
-
The authors reformulate multimodal action quality assessment with "missing modalities during training" as a "LLM-based conditional sequence-to-score reasoning" problem. By using prompts and special tokens, the LLM is guided to complete missing semantics without full data supervision. Combined with mask-aware dual-path fusion to suppress hallucination, the method outperforms SOTA models that rely on complete training data across three AQA datasets.
- Model-Dowser: Data-Free Importance Probing to Mitigate Catastrophic Forgetting in Multimodal Large Language Models
-
Model-Dowser scores each parameter in an MLLM using the product of "weight magnitude × input activation × output Jacobian." High-scoring parameters are frozen, and only low-scoring ones are updated. This enables deep fine-tuning on LLaVA/NVILA to learn downstream tasks while retaining pretraining knowledge. Compared to SPIDER and ModelTailor, it consistently leads in H-score.
- Multimodal Continual Learning with MLLMs from Multi-scenario Perspectives
-
To address the visual forgetting problem of MLLMs in cross-scenario VQA, this work constructs the MSVQA benchmark (four scenarios: high-altitude, underwater, low-altitude, indoor) and proposes the Unifier framework—adding CSR multi-branch modules and a projector (VRE) for parameter isolation within vision blocks, then aligning different branch representations with a KL-based soft constraint (VCC). With a single inference, Unifier improves VQA by 2.70–10.62% and F1 by 3.40–7.69% over 20-step continual learning.
- MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality
-
MUSE attributes the "understanding-generation" zero-sum dilemma of unified visual tokenizers to manifold misalignment, proposing the gradient orthogonality hypothesis—injecting semantics into \(W_V\) while structural gradients flow through \(W_{Q,K}\). Through Synergistic Block + DINOv3 topological alignment + NCE semantic anchoring, the two are fully decoupled. As a result, gFID 3.08 and linear probing 85.2% (even surpassing the InternViT-300M teacher at 82.5%) coexist, achieving genuine "mutual reinforcement" rather than trade-off for the first time.
- OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models
-
This paper identifies that existing Omni-LLM token compression methods treat audio and video "symmetrically," which is suboptimal. It proposes OmniSIFT—a two-stage, modality-asymmetric compression framework: first, spatio-temporal saliency prunes video redundancy to obtain "visual anchors," then these anchors guide audio token selection. With only 4.85M extra parameters, OmniSIFT consistently outperforms existing compression baselines and even the original model on Qwen2.5-Omni-7B when retaining 25% of tokens.
- Perceptual Flow Network for Visually Grounded Reasoning
-
Abandoning the traditional RLVR approach of "hard supervision with precise expert bounding boxes," PFlowNet models the perceptual behavior itself as a structured Perceptual Flow latent variable. It approximates the ideal reasoning-oriented posterior with a variational distribution \(p_\theta(Z|X)\), and is trained using Sub-TB variational RL, multi-dimensional rewards, and Vicinal Geometric Shaping. As a result, the 8B Qwen3-VL achieves a new SOTA of 90.6% on V* Bench and 67.0% on MME-RealWorld-lite.
- Referring Multiple Regions with Large Multimodal Models via Contextual Latent Steering
-
CSteer proposes a training-free latent steering method that constructs "contextual vectors" from the difference in hidden activations between incorrect/correct referring answers. These vectors are hierarchically injected into early query layers and mid-to-late decode layers during inference, enabling general LMMs (Qwen3-VL, InternVL-3.5) to outperform specialized region LMMs fine-tuned for multi-region visual referring tasks.
- Revis: Sparse Latent Steering to Mitigate Object Hallucination in Large Vision-Language Models
-
This work redefines LVLM hallucination as "visual information loss suppressed by language priors." By orthogonally projecting out the language prior from the original visual direction to obtain a 'pure visual vector,' and using risk gating to sparsely intervene at only the optimal single deep layer, the method reduces CHAIRS hallucination rate by ~19% without training, while preserving MM-Vet general capability.
- ReVSI: Rebuilding Visual Spatial Intelligence Evaluation for Accurate Assessment of VLM 3D Reasoning
-
This work systematically reveals that the widely used VSI-Bench suffers from structural failures due to 3D annotation drift and inconsistent frame sampling. The authors re-annotate 381 scenes and 5,365 objects, design frame-budget adaptive QA, and introduce a dummy video stress test by removing all frames containing the queried object, resulting in a high-fidelity spatial intelligence benchmark named ReVSI. Evaluation shows that open-source VLMs experience up to a 40% drop on ReVSI, and still exhibit high hallucination rates on dummy videos, exposing that current spatial reasoning abilities have been systematically overestimated by VSI-Bench.
- ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision
-
Addressing the prevalent use of "sparse grounding" annotation and loss of full-screen structure in GUI agents, this work introduces a fully automated Webshot pipeline to construct the dense screen parsing dataset ScreenParse, comprising 771K screenshots, 21M elements, and 55 classes. The authors train ScreenVLM, a model with only 316M parameters, to parse entire screens into ScreenTag structural sequences, outperforming 8B-scale foundation VLMs on both dense parsing and sparse grounding benchmarks while reducing latency to approximately \(1/4\).
- Seeing to Generalize: How Visual Data Corrects Binding Shortcuts
-
This paper reproduces the puzzling phenomenon that "VLMs outperform their base LLMs on pure text tasks" using a controlled synthetic "color-shape-item" retrieval task, and mechanistically explains it: visual training shifts the model's variable binding strategy from "positional shortcuts" to "semantic-symbolic matching." This shift is retained when switching back to pure text, boosting OOD retrieval accuracy from 37.2% to 69.5%. Consistent increases in the "symbolic/positional ratio" are also observed in real Qwen2/2.5/3 model families.
- Self-Captioning Multimodal Interaction Tuning: Amplifying Exploitable Redundancies for Robust Vision Language Models
-
This work leverages Pointwise Partial Information Decomposition to quantify vision-text modality interactions and proposes a Multimodal Interaction Gate: it automatically selects samples dominated by "image-unique information" and lets the VLM self-generate captions to inject into the text side, thereby converting unique visual signals into redundant shared signals. As a result, the VLM's visual hallucination under blurry or corrupted inputs drops by 38.3%, and consistency improves by 16.8%.
- SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs
-
SLQ appends a small set of "shared latent queries" \(\mathbf{Q}\) to the end of image/text token sequences, leveraging the MLLM's own causal attention to aggregate global context. By training only a few thousand query parameters, a frozen MLLM is turned into a retriever, outperforming full fine-tuning and LoRA on COCO/Flickr30K, and introduces KARR-Bench to evaluate "implicit knowledge reasoning" capabilities.
- The Perceptual Bandwidth Bottleneck in Vision-Language Models: Active Visual Reasoning via Sequential Experimental Design
-
This paper formalizes the issue of "VLMs failing to perceive details" as a Sequential Bayesian Optimal Experimental Design (S-BOED) problem, and proposes a training-free FOVEA module based on a computable proxy objective of "coverage × resolution." FOVEA consistently outperforms Direct and ReAct-style baselines on high-resolution and remote sensing benchmarks.
- Toward Structural Multimodal Representations: Specialization, Selection, and Sparsification via Mixture-of-Experts
-
This paper proposes the S3 framework, which uses MoE to decompose multimodal representations into concept-level experts (Specialization), activates relevant experts per task via routing (Selection), and prunes low-contribution paths at inference based on routing scores (Sparsification). On four MultiBench benchmarks, it reveals an "inverted-U" curve where performance peaks at intermediate sparsity, presenting a third paradigm for multimodal representation beyond contrastive learning/InfoMax.
- Vision-aligned Latent Reasoning for Multi-modal Large Language Model
-
This paper proposes VaLR: inserting several "latent tokens" before each CoT reasoning step in MLLMs, and aligning these tokens with patch features from visual encoders such as DINOv3/SigLIP/π³ (REPA), thereby continuously "feeding back" visual information to the model during long-chain reasoning. This approach boosts Qwen2.5-VL's accuracy on VSI-Bench from 33.0% to 52.9%, and for the first time enables MLLMs to exhibit "the longer the reasoning, the higher the accuracy" test-time scaling behavior.
- What You Think is What You See: Driving Exploration in VLM Agents via Visual-Linguistic Curiosity (GLANCE)
-
GLANCE introduces a "think-see alignment" self-supervised head into VLM agent reinforcement learning: the LLM’s "next state prediction" in CoT is projected via a lightweight projector to the real next-frame representation encoded by an EMA target vision encoder. The prediction gap serves simultaneously as intrinsic curiosity reward, training signal for the vision encoder, and an alignment loss to ground the internalized world model. Combined with a curriculum exploration mechanism that periodically resets the projector to counteract curiosity drain, GLANCE consistently outperforms existing exploitation-only VLM-RL methods across five agentic tasks.
🎨 Image Generation¶
- Adversarial Flow Models
-
The authors add an optimal transport regularization \(\|G(z)-z\|^2\) to the GAN training objective, constraining the "arbitrary transport map" of GANs to the Wasserstein-2 optimal transport map. This enables stable adversarial training and end-to-end one-step generation on pure transformers for the first time. On ImageNet-256, 1NFE FID reaches 2.38 (XL/2) and 1.94 (112 layers).
- Anomaly-Preference Image Generation (APO)
-
The authors reformulate "few-shot anomaly image generation" as a "preference optimization problem without manual annotation": real anomalies serve as positive samples, while the denoising bias of the reference model at the same timestep acts as an implicit negative sample. A DPO-style loss aligns the diffusion model to the anomaly distribution. Time-aware LoRA rank adjustment (TACA) preserves structural diversity, and hierarchical CFG controls text-anomaly alignment strength. On benchmarks like MVTec, both fidelity and diversity are improved.
- Caracal: Causal Architecture via Spectral Mixing
-
Caracal replaces the \(\mathcal{O}(L^2)\) attention in Transformers with an \(\mathcal{O}(L \log L)\) Multi-Head Fourier (MHF) module, achieving strict causal masking in the frequency domain via a "pad-FFT-multiply-iFFT-truncate" pipeline. It completely removes positional encoding, relying solely on standard FFT operators (without custom CUDA kernels like Mamba), and matches the performance of Llama / Mamba / Mamba-2 / Jamba across all model scales from Tiny to Large.
- CARD: Coarse-to-fine Autoregressive Modeling with Radix-based Decomposition for Transferable Free Energy Estimation
-
CARD uses "radix \(r\) decomposition" to bijectively map molecular 3D coordinates into a coarse-to-fine discrete-continuous mixed token sequence, enabling a cross-system transferable autoregressive Transformer to serve as a "zero free energy proposal" and directly estimate the absolute free energy of any molecular system via BAR. On solvation tasks for 70 novel systems, it matches the accuracy of classical MFES while being about 40 times faster in inference.
- CoCoEdit: Content-Consistent Image Editing via Region Regularized Reinforcement Learning
-
This work addresses the issue of "editing models making unintended changes in non-editing regions" by constructing the CoCoEdit-40K local editing dataset, introducing a pixel-level similarity reward to complement the MLLM reward, and designing a region-regularized RL objective (constraining non-editing regions for high-reward samples, forcing changes in editing regions for low-reward samples). This approach improves both editing scores and PSNR/SSIM for FLUX.1 Kontext and Qwen-Image-Edit, breaking the existing trade-off between editing capability and content consistency.
- Conditional Diffusion Sampling
-
This paper proposes Conditional Diffusion Sampling (CDS): by deriving a class of conditional stochastic interpolants, it obtains an exact closed-form SDE for the unnormalized target distribution (without neural network fitting), and then efficiently samples the initial distribution of this SDE using Parallel Tempering (PT)—combining PT's global exploration with the local refinement of the diffusion process. On 8 target distributions and 4 task types, CDS outperforms traditional MCMC, training-free MCMC, and neural samplers with fewer density evaluations.
- Diagnosing and Correcting Concept Omission in Multimodal Diffusion Transformers
-
This paper uses linear probes to discover that in MM-DiT (FLUX / SD3.5), certain attention heads in intermediate layers naturally encode a binary signal in the key vectors of text tokens, indicating whether a target concept will appear. Based on this, the authors propose Omission Signal Intervention (OSI): during inference, they inject the mean difference direction between "omission" and "existence" classes into the key vectors of the top-K heads with strength \(\alpha\sigma\boldsymbol{\theta}\), thereby activating the model's "self-awareness" of missing concepts and prompting completion. On FLUX, GenEval 6-object accuracy improves from 0.18 → 0.40, without any fine-tuning.
- End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer
-
EOSTok jointly trains a 1D ViT tokenizer and an autoregressive model in a single-stage end-to-end pipeline. The newly proposed APR (Autoregressive Prediction Reconstruction) loss enables gradients from "next-token prediction" to flow back to the pixel space, preventing codebook collapse. "Implicit alignment" injects DINOv2 semantics into the 1D latent space without disrupting the 1D autoregressive structure. Ultimately, EOSTok achieves an FID of 1.48 on ImageNet 256 without guidance (SOTA).
- Exploring and Exploiting Stability in Latent Flow Matching
-
This work systematically characterizes the "trajectory stability" of Latent Flow Matching (LFM)—under the same noise seed, pruning 75% of the data, changing model size, or altering training seeds still produces nearly identical images. This property is then translated into two practical algorithms: (1) Using balanced-clustering pruning, 50% of CelebA-HQ data can be pruned with a slight FID improvement, and 75% can be pruned on ImageNet; (2) A Coarse-to-Fine two-stage generation, combining DiT-XL/2 (675M) and DiT-S/2 (33M), achieves 2.15× faster inference.
- Factored Classifier-Free Guidance
-
This paper identifies an "attribute amplification" failure mode of CFG in counterfactual generation with diffusion models—using a single global \(\omega\) amplifies not only the target attribute but also unintended ones. The authors propose FCFG: grouping attributes by causal graph and assigning independent guidance weights to each group, which significantly reduces non-target attribute drift and improves counterfactual reversibility on CelebA-HQ / EMBED / MIMIC-CXR.
- GenExam: A Multidisciplinary Text-to-Image Exam
-
GenExam treats "drawing exams" as the gold standard for evaluating the comprehensive reasoning-understanding-generation abilities of T2I models. It provides 1,000 questions across 10 disciplines, each with a ground-truth image and fine-grained scoring points. Even the strongest closed-source model, Nano Banana Pro, achieves only 70.2% strict score, while most open-source T2I/unified MLLMs score below 3%.
- Implicit Preference Alignment for Human Image Animation
-
The authors propose Implicit Preference Alignment (IPA): a post-training method that requires only "good samples" and does not need to construct good/bad pairs. By maximizing the KL gap with a pretrained reference model, IPA equivalently maximizes an implicit reward. Combined with a HALO module that weights hand masks in the loss, this enables large-scale video DiT models to significantly improve hand fidelity in human animation using only 93 selected samples.
- Krause Synchronization Transformers
-
The authors introduce the Krause bounded confidence consensus model into Transformers, replacing global softmax similarity with "distance-RBF + local window + top-k sparsity." They theoretically prove this encourages multi-cluster synchronization rather than global collapse, and demonstrate superior performance and over 30% compute savings on ViT, autoregressive image generation, and LLMs.
- Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs
-
This paper targets rectified flow (RF) text-to-image models and proposes PNAPO—an offline preference optimization framework that saves both the "prior noise used for generation" and "winner/loser images" as sextuplets. Leveraging the RF straight-line trajectory assumption for trajectory estimation and dynamic regularization scheduling, PNAPO outperforms Diffusion-DPO on SD3-M/FLUX while reducing training compute to 1/12.
- Riemannian Generative Decoder
-
This work addresses the challenge that Riemannian VAEs require hand-crafted, complex probability densities for each manifold. It proposes the Riemannian Generative Decoder (RGD), which entirely discards the encoder and treats each sample's latent as a free parameter, trained directly with a Riemannian optimizer (RiemannianAdam). It introduces "input noise inversely scaled by local metric" as a geometric regularizer. On three real biological datasets—synthetic branching diffusion tree, human mitochondrial DNA, and cell cycle scRNA-seq—RGD recovers more faithful geometry and achieves superior numerical stability over VAE baselines in high dimensions.
- SpatialReward: Bridging the Perception Gap in Online RL for Image Editing via Explicit Spatial Reasoning
-
The authors identify the "attention collapse" issue in MLLM-based editing reward models—where the model, instead of comparing the original and edited images, collapses attention onto sink tokens and makes blind judgments. They propose SpatialReward: an 8B model first predicts bounding boxes of edited regions, then uses these box tokens as anchors for interleaved cross-image reasoning. With a 260K-sample spatially-aware dataset and two-stage GRPO training, the method achieves SOTA on three reward benchmarks and boosts OmniGen2's GEdit-Bench score by +0.90 (twice the improvement of GPT-4.1).
- Speculative Coupled Decoding for Training-Free Lossless Acceleration of Autoregressive Visual Generation
-
This work identifies that the root cause of limited acceleration in Speculative Jacobi Decoding (SJD) for autoregressive visual generation is the near-zero probability of collision between draft tokens across consecutive iterations due to independent sampling. By simply replacing independent sampling with Maximal/Gumbel Coupling (a one-line modification, no extra training), image generation can be accelerated up to \(4.2\times\) and video generation up to \(13.6\times\), while strictly preserving the output distribution identical to original AR decoding.
- Structured Diffusion Bridges: Inductive Bias for Denoising Diffusion Bridges
-
SDB reframes modality translation as "selecting a coupling from all joint distributions \(\mathcal{P}\) satisfying marginal constraints," stacking marginal matching (WTA + capacity constraint) and both endpoint-level and trajectory-level cycle consistency on top of LDDBM. Paired supervision becomes merely an optional heuristic, enabling training under zero-paired, semi-paired, and fully-paired regimes. Even with full pairing, SDB outperforms paired-only baselines (e.g., FFHQ→CelebA-HQ PSNR improves from 25.6 to 25.9).
- The Coupling Within: Flow Matching via Distilled Normalizing Flows
-
This paper proposes NFM (Normalized Flow Matching), which uses the "accurate data→noise bijection" produced by a pretrained autoregressive normalizing flow (NF) such as TarFlow as the noise-data pairing for Flow Matching. This approach simultaneously advances FM's convergence speed and low-step FID, and, in turn, achieves inference speeds several orders of magnitude faster than the NF teacher.
- Threshold-Guided Optimization for Visual Generative Models
-
The authors remove the paired preference assumption of DPO, proving that the optimal strategy for KL regularization essentially compares each sample's reward to an intractable instance-dependent baseline \(\tau^*(x)=\beta\log Z(x)\). They propose replacing it with a global threshold \(\tau\) estimated from a score percentile, and introduce a confidence weight proportional to \(|s-\tau|\). This enables stable alignment of diffusion models and MaskGIT using only scalar scores (without paired preferences), consistently outperforming Diffusion-DPO / KTO / DSPO across five reward models and three test sets.
- Visual Implicit Autoregressive Modeling
-
This work embeds Deep Equilibrium (DEQ) implicit fixed-point layers into the next-scale autoregressive framework of VAR, using Jacobian-Free Backpropagation to achieve constant memory training, compressing the 2B parameters of VAR-d30 down to 770M. During inference, the number of iterations per scale becomes a "tunable knob"—on ImageNet-256, FID 2.16/sFID 8.07 are maintained, while peak memory on a single 4090 drops from 19.24GB to 8.53GB and throughput increases from 15.16 to 32.08 img/s.
- Watch Your Step: Information Injection in Diffusion Models via Shadow Timestep Embedding
-
This paper reveals that the often-overlooked "timestep embedding" in diffusion models is in fact an unused information side channel. By extending the training timestep range to a "shadow interval" and binding a different data distribution to this interval, it is possible—without changing the scheduler interface—for the same diffusion model to generate normal images in the explicit interval and "hidden" images in the shadow interval. This enables both covert backdoor attacks and model watermark verification. The paper also provides a mutual coherence theoretical analysis based on sinusoidal positional encoding, explaining why two disjoint intervals can carry independent information.
🔬 Interpretability¶
- All Circuits Lead to Rome: Rethinking Functional Anisotropy in Circuit and Sheaf Discovery for LLMs
-
This paper systematically falsifies a core implicit assumption in mechanistic interpretability—"one LLM capability corresponds to a unique circuit"—using the Overlap-Aware Sheaf Repulsion (OASR) algorithm. It finds that the same task can be supported by multiple, nearly non-overlapping (IoU ~4–11%) but all faithful/sparse/complete circuits or sheaves, and proposes the "Distributive Dense Circuit Hypothesis" as a theoretical explanation.
- Barriers to Counterfactual Credit Attribution for Autoregressive Models
-
This paper formalizes the problem of "Counterfactual Credit Attribution (CCA)" for generative models in RAG/in-context deployment, and proves two surprising negative results: (1) Even if the underlying next-token predictor is (0,0)-CCA, the autoregressive rollout is not CCA—CCA does not naturally compose under autoregression as DP does; (2) Black-box "CCA retrofitting" of a deployed non-attributing model requires at least exponential (in output length \(\ell\)) number of queries.
- Circuit Fingerprints: How Answer Tokens Encode Their Geometrical Path
-
This paper proposes the Circuit Fingerprint hypothesis: when an answer token is fed into a Transformer in isolation, the direction it leaves in the hidden space precisely corresponds to the circuit path required to produce that answer. Based on this, circuit discovery can be achieved via pure geometric alignment (without gradients/interventions), and the same set of directions can be used for activation steering, demonstrating that "read" and "write" are two sides of the same geometric object.
- CorrSteer: Generation-Time LLM Steering via Correlated Sparse Autoencoder Features
-
By selecting interpretable steering features whose SAE activations on generated tokens are Pearson-correlated with task correctness, and directly using the mean activation on positive samples as the coefficient—without needing contrastive datasets or backpropagation—CorrSteer achieves +3.3% on MMLU and +27.1% on HarmBench for Gemma-2 2B / LLaMA-3.1 8B, with lower side effect rates than fine-tuning.
- Disentangling Direction and Magnitude in Transformer Representations: A Double Dissociation Through L2-Matched Perturbation Analysis
-
This paper uses an L2-matched perturbation protocol to demonstrate that, in the Pythia series, direction (angle) perturbations are 42.9 times more destructive to language modeling loss than magnitude perturbations of the same displacement, while magnitude perturbations are far more damaging to syntax (subject-verb agreement) than angle—constituting a "double dissociation" in the cognitive neuroscience sense, with direction effects propagating via the attention pathway and magnitude via the LayerNorm pathway.
- Do Activation Verbalization Methods Convey Privileged Information?
-
This work systematically demonstrates that current popular activation verbalization methods (Patchscopes / LIT / SelfIE), when used as LLM interpretability tools, have their performance fully explained by the "verbalizer model's own knowledge," without requiring any internal activations from the target model. This implies that these tools only appear to work on existing benchmarks due to flaws in benchmark design, and when the verbalizer's knowledge exceeds that of the target, it fabricates "explanations" the target does not possess.
- SemGrad: Gradients w.r.t. Semantics-Preserving Embeddings Tell LLM Uncertainty
-
SemGrad is the first to bring "gradient-based" uncertainty quantification to LLM free-form generation. It uses the Semantics-Preserving Score (SPS) to identify hidden states encoding input semantics, and treats the norm of the log-likelihood gradient with respect to these states as a measure of LLM confidence. Without sampling and with only a single backward pass, it outperforms 11 SOTA baselines on 3 QA datasets, especially surpassing SAR by 3.27 AUROC on the multi-answer TruthfulQA.
- Grokking: From Abstraction to Intelligence
-
This paper provides a unified explanation of the grokking phenomenon from the perspective of structural simplification (Occam's razor): during training, the model undergoes four types of "internal condensation" that occur synchronously—causal mediation degradation, manifold collapse to the \(\mathbb{Z}_{97}\) ring, spectral energy concentrating on sparse Fourier modes, and a sharp drop in BDM algorithmic complexity. Using an analytically tractable singular feature machine (SFM), it is shown that these are equivalent to a free energy-driven phase transition.
- Interpretability Can Be Actionable
-
This is a position paper arguing that "what interpretability research lacks is not new methods, but evaluation criteria": research should use actionability (whether insights can drive concrete decisions/interventions outside the interpretability domain) as a core evaluation dimension. The authors define actionability along the axes of concreteness and validation, analyze obstacles, list five high-leverage application domains, and provide a six-step checklist for researchers.
- Is One Layer Enough? Understanding Inference Dynamics in Tabular Foundation Models
-
The authors conduct the first large-scale hierarchical mechanistic analysis of six mainstream tabular foundation models (TFMs), discovering that the middle and later layers mainly perform "iterative refinement" and contain substantial redundancy. Based on this, they design a single-layer recurrent TFM using only 20% of the parameters, achieving performance nearly matching the original six-layer version.
- Manifold-Aligned Guided Integrated Gradients for Reliable Feature Attribution
-
This paper proposes MA-GIG: transferring the “select low-gradient features and take a step” strategy of Guided IG from pixel space to the latent space of a pretrained VAE. By leveraging the decoder Jacobian, axis-aligned updates in latent space are mapped into updates within the tangent space of the data manifold, thus both avoiding high-gradient noise regions and ensuring that samples along the integration path remain close to the true data manifold, resulting in more reliable attributions.
- Memory as a Markov Matrix: Sample Efficient Knowledge Expansion via Token-to-Dictionary Mapping
-
Interprets the next-token distribution of an autoregressive LLM as the state transition matrix of a Markov chain, so "learning new words" becomes "adding new states to the state space and representing them as sparse combinations of existing states." Theoretically, only \(O(s)\) samples are needed (\(s\) is the number of mapped old tokens); in practice, simply finetuning the embedding of the new token suffices to achieve cross-lingual/new concept expansion with strict zero forgetting.
- Optimal Attention Temperature Improves the Robustness of In-Context Learning under Distribution Shift in High Dimensions
-
Within the high-dimensional linear regression ICL framework, this work employs an "approximate softmax attention" that preserves softmax normalization and temperature selectivity while remaining analytically tractable, deriving a closed-form solution for ICL generalization error and an explicit formula for the optimal attention temperature \(\tau_{\text{opt}}\). It is proven that simply tuning the temperature at inference can recover near Bayes-optimal performance. The effectiveness of this "lightweight knob" is also validated on real QA tasks with GPT-2 and Llama2-7B.
- Probabilistic Modeling of Latent Agentic Substructures in Deep Neural Networks
-
The authors formalize neural networks (especially LLMs) as composite agents synthesized from multiple implicit sub-agents (each a probability distribution over outcomes) via log-weighted pooling. Within the cognitive utility framework \(W_i(o)=\log P_i(o)\), they prove that "strict unanimity benefit" is impossible under linear pooling or binary outcomes, but feasible when \(|\mathcal O|\ge 3\). This leads to the alignment principle that "explicitly manifesting Waluigi before suppression" is strictly superior to "only reinforcing Luigi".
- Provably Learning Attention with Queries
-
The authors prove that single-head softmax attention can be exactly recovered with remarkable simplicity under value-query access—requiring only \(O(d^2)\) queries, which is much easier than for ReLU MLPs of similar structure. When the head dimension \(r\ll d\), compressed sensing reduces this to \(O(rd)\). The results extend to noisy oracles, membership queries, and the unidentifiability of multi-head attention.
- Steer Like the LLM: Activation Steering that Mimics Prompting
-
This paper reinterprets "prompt steering" as a form of activation steering natively implemented by LLMs, and then distills the activation difference induced by prompt injection using a token-wise ReLU probe. The resulting Prompt Steering Replacement (PSR) module not only outperforms existing activation steering methods (CAA, ReFT-R1, Stolfo, etc.) on three steering benchmarks, but also matches or surpasses prompting on AxBench and persona steering tasks.
- The Cylindrical Representation Hypothesis for Language Model Steering
-
This paper proposes the Cylindrical Representation Hypothesis (CRH), which relaxes the orthogonality assumption of the LRH while retaining "concept linearity." It demonstrates that the superposition of concept vectors naturally induces a cylindrical geometry of "axis + normal plane + sensitive sector," thereby providing the first geometric explanation for why activation steering is unpredictable at the sample level but observable at the population level.
- The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity
-
This paper reveals the structural root of "attention sink to the first token" in LLMs—under causal masking, the first token lacks value aggregation, leading to variance discrepancy, which is selectively amplified by super neurons in the FFN, resulting in extreme dimension disparity. This ultimately locks the QK projection, forcing the formation of an attention sink. Based on this, the authors propose head-wise RMSNorm during pretraining to fundamentally suppress the sink.
- Towards Steering without Sacrifice: Principled Training of Steering Vectors for Prompt-only Interventions
-
The authors leverage infinite-width neural network scaling theory to derive that joint training of the steering vector’s factor/direction should satisfy the scaling constraint \(\eta_{\mathbf{v}}\eta_{\alpha}=\Theta(1)\), thereby eliminating the need for manual selection of \(\alpha\) during inference. Inspired by ReFT, they apply additive intervention only to the first 4 prompt tokens (PrOSV). On AxBench, this approach maintains model utility and consistently outperforms full-sequence FSSV across three Gemma2/Qwen2.5 model scales.
- Understanding LoRA as Knowledge Memory: An Empirical Analysis
-
The authors conduct a systematic empirical audit using the PhoneBook and the newly constructed PaperQA benchmarks, treating LoRA as independently trainable/loadable/combinable knowledge memory units. They quantitatively provide a full-chain design guideline from "rank → capacity → efficiency → multi-module composition → complementarity with RAG/ICL".
- Why Linear Interpretability Works: Invariant Subspaces as a Result of Architectural Constraints
-
This paper provides an architectural-level explanation for "why the internal representations of transformers can be repeatedly and successfully decoded by simple linear methods (probe, SAE, activation steering)": as long as semantic features are read out via linear interfaces such as OV circuits or unembedding, they must reside in a context-invariant linear subspace (Invariant Subspace Necessity theorem); this leads to a zero-shot application—the Self-Reference Property, i.e., the embedding direction of a token itself is its concept direction, enabling unsupervised classification directly using the geometric position of class tokens.
📦 Model Compression¶
- ArcVQ-VAE: A Spherical Vector Quantization Framework with ArcCosine Additive Margin
-
The authors diagnose the root cause of codebook collapse in VQ-VAE as "codebook vector \(\ell_2\) norm imbalance + geometric clustering." They propose SAMP: Ball-Bounded Norm Regularization constrains all codebook vectors within a time-varying Euclidean ball, and ArcCosine Additive Margin Loss, inspired by ArcFace, pushes latent vectors apart on the sphere. This leads to uniform codebook dispersion and a significant increase in utilization, outperforming mainstream VQ-VAE variants on ImageNet reconstruction and generation FID.
- Breaking the MoE LLM Trilemma: Dynamic Expert Clustering with Structured Compression
-
To address the MoE LLM "load imbalance–parameter redundancy–communication overhead" trilemma, this paper proposes a unified framework: experts are grouped online via dual "parameter + activation" similarity clustering; within each group, "shared base matrix + low-rank residual" structured compression (~5×) is applied; then, a two-level hierarchical routing ("group selection then expert selection") is performed, combined with FP16/INT4 heterogeneous precision and offline unloading of idle groups. On GLUE/WikiText-103, this achieves about 80% parameter reduction, 10–20% throughput improvement, and a 3× reduction in expert load variance, while matching standard MoE performance.
- Demystifying When Pruning Works via Representation Hierarchies
-
Starting from the three-stage representation hierarchy "embedding → logit → probability," this paper uses Taylor local expansion theory to prove: pruning introduces inherently small perturbations in the embedding and logit spaces, but the nonlinear softmax step amplifies these perturbations into the probability space by a factor of \(\mathrm{Var}_r(\Delta z)/(2T^2)\). Through stepwise accumulation in autoregressive decoding, this ultimately leads to catastrophic failure in generation tasks. In contrast, non-generation tasks are naturally robust to pruning since they only depend on a candidate token subspace—this unifies the explanation for why pruning is nearly lossless on MMLU and retrieval but drops to zero on GSM8K and HumanEval.
- Dispersion Loss Counteracts Embedding Condensation and Improves Generalization in Small Language Models
-
This work systematically observes the prevalent phenomenon of "token embeddings in small language models condensing into a narrow cone with depth" (embedding condensation)—a phenomenon not seen in large models—and proposes an angular dispersion loss \(\mathcal{L}_{\text{disp}}\) that directly encourages embedding dispersion. Without introducing extra parameters, this loss yields an average improvement of 3.3% on 10 benchmarks for Qwen3 / GPT2.
- Don't Ignore the Tail: Decoupling top-K Probabilities for Efficient Language Model Distillation
-
This paper proposes TAD (Tail-Aware Distillation): by explicitly separating the teacher's top-\(K\) probabilities from the "tail" probabilities in the standard KD KL divergence and amplifying the tail's contribution, it enables LLM pretraining distillation within academic-level compute (single H100 + 1 week), achieving average performance superior to data-centric methods like MiniPLM.
- FedRot-LoRA: Mitigating Rotational Misalignment in Federated LoRA
-
This paper identifies that the true "enemy" of naive factor-wise averaging in federated LoRA is the latent subspace misalignment caused by rotational invariance. It proposes that each client solves for a rotation matrix \(R_i^t\) via orthogonal Procrustes to align \(A,B\) factors before aggregation. Both theoretical and experimental results demonstrate significant reduction in aggregation error without increasing communication overhead.
- FlattenGPT: Depth Compression for Transformer with Layer Flattening
-
This paper proposes FlattenGPT, which first "flattens" and merges adjacent transformer layers in LLMs with highly similar inputs into a single layer with 2× width (retaining all parameter knowledge), then applies channel pruning to the merged layer to restore the original width—thus achieving inference acceleration via depth compression while avoiding the catastrophic performance drop from directly discarding entire layers as in traditional pruning.
- From Per-Image Low-Rank to Encoding Mismatch: Rethinking Feature Distillation in Vision Transformers
-
The authors use a three-perspective analysis—sample-wise SVD, dataset-level PCA, and token-level Spectral Energy Pattern (SEP)—to reveal a seemingly paradoxical geometry in ViT representations: "Each image's feature matrix is low-rank, but the cross-image shared subspace is nearly full-rank, and the spectral bandwidth of single tokens approaches 100%." They then propose two minimal patches, Lift (retaining the lifting projector at inference) and WideLast (widening only the last block to teacher width), which enable plain MSE feature distillation to boost DeiT-Tiny ← CaiT-S24 from 74.86% to 78.23%.
- Linearizing Vision Transformer with Test-Time Training
-
The authors observe that the inner model of two-layer TTT is structurally equivalent to Softmax attention (Softmax can be viewed as a two-layer dynamic MLP). This enables direct inheritance of all Q/K/V/MLP weights. Key Instance Normalization is used to handle shift-invariance, and depthwise conv on Q/K is added to inject locality. With only 1 hour of fine-tuning, Stable Diffusion 3.5 is linearized and accelerated by 1.32×–1.47×.
- OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization
-
OSAQ leverages the observation that the Hessian of each LLM layer maintains a consistent low-rank null space across different inputs. By linearly combining the null space vectors into an additive weight perturbation \(\Delta W\), OSAQ "self-absorbs" outlier weights without altering the second-order task loss, reducing the perplexity of 2-bit weight-only quantization by over 40% compared to naive GPTQ.
- Preserve-Then-Quantize: Balancing Rank Budgets for Quantization Error Reconstruction in LLMs
-
The authors propose SRR (Structured Residual Reconstruction), which explicitly splits the fixed low-rank budget \(r\) in QER (Quantization Error Reconstruction) into two parts: "preserve the top \(k\) principal singular directions before quantization" and "use the remaining \(r-k\) ranks to fit the residual." They provide a closed-form criterion requiring only a single random probe to select \(k^\star\) per layer, consistently outperforming LQER/QERA in 2/3-bit PTQ and QPEFT.
- Proxy Compression for Language Modeling
-
The authors propose "proxy compression": during training, 90% of data is fed as short sequences produced by a tokenizer or neural compressor, and 10% as raw UTF-8 bytes, combined with sentinel tokens and a brief in-context translation warm-up. At inference, all compressors are discarded and the model sees only raw bytes, yet it significantly outperforms pure byte models under fixed compute, and at scale matches or surpasses tokenizer baselines.
- Resting Neurons, Active Insights: Robustify Activation Sparsity for Large Language Models
-
This paper attributes the performance drop in LLMs caused by activation sparsity to "representation drift." Inspired by biological spontaneous firing, it injects a small, input-independent vector (SPON) into each layer, which can be absorbed into the bias after training. This approach significantly narrows the gap between sparse and dense models with nearly zero inference overhead.
- RQ-MoE: Residual Quantization via Mixture of Experts for Efficient Input-Dependent Vector Compression
-
RQ-MoE employs a "two-level MoE + dual-stream quantization" design, enabling the codebook for residual vector quantization (RQ) to be dynamically generated per input, and achieves 6–14× decoding acceleration by decoupling the instruction and reconstruction streams. On four retrieval benchmarks, it matches or surpasses QINCo in MSE/Recall.
- ScaLoRA: Optimally Scaled Low-Rank Adaptation for Efficient High-Rank Fine-Tuning
-
The authors prove that LoRA's cumulative updates are trapped in a fixed low-rank subspace and propose ScaLoRA: at each step, after merging the old \(AB^\top\) into \(W^{pt}\), the adapter is restarted with an analytically optimal "column scaling", enabling AdamW first/second moments to be transferred equivariantly in \(O((m+n)r)\) time (no reset/warm-up needed), and cumulative updates naturally become high-rank. ScaLoRA consistently outperforms LoRA / MoRA / HiRA / ReLoRA / LoRA-GA on DeBERTaV3, LLaMA2-7B, LLaMA3-8B, and Gemma3-12B.
- Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression
-
This paper first uses the new benchmark system KVFundaBench to reveal a key asymmetry: "retrieval-type long-context tasks can be compressed, reasoning-type cannot." The root cause is attributed to KV compression breaking the integrity of few-shot examples as "semantic units." Based on this, ShotKV is proposed—preserving each shot as an indivisible unit during prefill, and applying dynamic token-level compression during decoding. This allows LG-GSM8K to improve from a baseline of 46.0 to 47.33 at a 40% compression rate, and reduces end-to-end latency by 11.3% under long-input settings.
- Stochastic Sparse Attention for Memory-Bound Inference
-
SANTA treats attention value aggregation \(AV\) as "weighted sum of value rows \(V\) by softmax probabilities \(A\)," and replaces it with an unbiased estimator: "sample \(S\ll n_k\) indices from \(A\) without replacement and directly average the corresponding \(V\) rows." Stratified/systematic sampling is used to reduce variance, and the method is implemented as a GPU kernel aligned with FlashDecoding. On 32k context, it achieves 1.5× end-to-end speedup over FlashInfer/FlashDecoding without loss of accuracy.
- SURGE: Surrogate Gradient Adaptation in Binary Neural Networks
-
SURGE attaches a "full-precision auxiliary branch" in parallel to each binarized layer. The forward output remains unchanged, but in the backward pass, an extra "non-STE truncated" higher-order gradient is backpropagated from the full-precision branch. AGS dynamically balances the contributions of both paths according to the gradient norm ratio, enabling BNNs to achieve 62.0% top-1 on ResNet-18/ImageNet—1.0% higher than ReCU and 3.9% higher than IR-Net.
- Task-Driven Subspace Decomposition for Knowledge Sharing and Isolation in LoRA-based Continual Learning
-
LoDA decomposes the LoRA down-projection matrix by "projection energy" into a general subspace shared across tasks and an isolated subspace that is only activated by new tasks. It then uses gradient alignment to train the up-projection and applies a closed-form recalibration to the general branch during merging, thereby consistently outperforming existing LoRA-CL methods on multiple continual learning benchmarks.
- Test-Time Training with KV Binding Is Secretly Linear Attention
-
This paper uses four "memory paradox" counterexamples and a set of rigorous unrolling theorems to prove that TTT with KV-binding inner loops (e.g., LaCT, ViTTT), even with multi-layer MLPs and momentum, is essentially "learned linear attention operators." Based on this, the authors simplify and parallelize it into standard linear attention, achieving a 4× throughput boost with almost no performance drop.
- Token Sparse Attention: Efficient Long-Context Inference with Interleaved Token Selection
-
The authors observe that the "importance" of tokens varies drastically across layers and heads; traditional token eviction, which removes tokens in one shot, is an irreversible early decision error. They propose Token Sparse Attention, where each attention head in each layer independently selects \(L' \ll L\) tokens for dense attention, then scatters the output back to the original sequence length, with a residual path allowing skipped tokens to be reconsidered in the next layer. This preserves both head/layer-level dynamic selection and compatibility with dense kernels like FlashAttention. Combined with FlexPrefill, it achieves ×3.23 attention speedup with <1% accuracy loss on 128K context.
💡 LLM Reasoning¶
- A Formal Comparison Between Chain of Thought and Latent Thought
-
Starting from computational complexity theory, this paper formally compares the expressive power of CoT (Chain of Thought) and Latent Thought (Looped Transformer / Coconut). It proves that Latent Thought strictly achieves \(\mathsf{TC}^k\) at polylogarithmic depth, while CoT reaches at most \(\mathsf{TC}^{k-1}\). Additionally, under probabilistic settings, it is shown for the first time that CoT can support FPRAS counting via random decoding, thereby surpassing deterministic Latent Thought.
- ANCHOR: Abductive Network Construction with Hierarchical Orchestration for Reliable Probability Inference in Large Language Models
-
ANCHOR employs "bottom-up abduction + hierarchical clustering" to construct a dense factor space, retrieves a sparse set of relevant factors for downstream conditions via coarse-to-fine search, and aggregates posteriors using both Naïve Bayes and a latent-variable causal Bayesian network constructed on-the-fly by an LLM. This approach significantly reduces "unknown" predictions and improves probability calibration in high-risk LLM decision scenarios.
- Automated Formal Proofs of Combinatorial Identities via Wilf–Zeilberger Guidance and LLMs
-
WZ-LLM compiles the classical Wilf–Zeilberger symbolic proof process into executable proof skeletons in Lean 4 (recurrence + boundary conditions + side conditions), which are then discharged item by item by a WZ-Prover trained via SFT + expert-iteration + DAPO. On 100 classical combinatorial identities, pass@32 improves from Goedel-Prover-V2's 9% to 34%.
- Break the Block: Dynamic-size Reasoning Blocks for Diffusion Large Language Models via Monotonic Entropy Descent with Reinforcement Learning
-
To address the issue where "fixed block size" in diffusion language models (dLLM) during semi-autoregressive generation disrupts the logical chain of reasoning, this paper proposes b1: using RL to learn a block-ending indicator token for generating dynamic-length blocks, and introducing a "block-level monotonic entropy descent (MED) reward" to drive coherent reasoning. This reward can be plugged into existing dLLM RL frameworks (Diffu-GRPO/GDPO/d1/wd1) as a plug-and-play component, boosting wd1 on Countdown from 39.45 to 58.98.
- Conformal Thinking: Risk Control for Reasoning on a Compute Budget
-
This work reframes the problem of "when to stop reasoning in LLMs" from an opaque threshold-tuning task into a user-specifiable risk tolerance conformal risk control problem: using two thresholds—an upper threshold to stop when the model is confident (controlling false positives), and a newly proposed parameterized lower threshold to force stop when the model is "stuck" on unsolvable problems (controlling false negatives). The UCB algorithm is used to automatically determine thresholds from a calibration set that satisfy risk constraints, achieving "almost no drop in accuracy, significant token savings" on AIME / GPQA / MathVision.
- Efficient Reasoning with Hidden Thinking
-
Heima distills each stage (summary / caption / reasoning) of a multimodal LLM’s lengthy CoT into a special thinking token, enabling the model to "think" in latent space. The number of tokens drops from 100-200 to 13-16, while zero-shot accuracy is more stable than LLaVA-CoT. An auxiliary LLM "interpreter" is trained to reconstruct the textual reasoning chain from the thinking token’s hidden state, empirically verifying the information-theoretic upper bound of compression loss.
- Entropy-informed Decoding: Adaptive Information-Driven Branching
-
EDEN (Entropy-informed DEcodiNg) sets the beam width \(B_t\) at each step to be monotonically proportional to the normalized entropy \(\bar H_t\)—high entropy forks more branches, low entropy steps approach greedy decoding—thus approximating a wider beam search with fewer total expansions. Theoretically, it is proven that entropy-monotonic branching factors are strictly superior to any fixed beam width in terms of expected cumulative regret, with an explicit regret rate \(\mathbb{E}[R_T] \leq G P_\max \sum_t \exp(-c m_t \Delta_\min^2)\).
- ETS: Energy-Guided Test-Time Scaling for Training-Free RL Alignment
-
ETS samples directly from the closed-form optimal solution of the KL-regularized RLHF objective, expressing it as a "reference policy × conditional expectation of exponential reward (energy term)", and then uses Monte Carlo + self-normalized importance sampling at test time to approximate this energy term. This achieves, or even surpasses, the performance of RL-trained policies without any training, and leverages lightweight proposals + Fast-dLLM to keep latency within practical bounds.
- Express Your Doubts: Probabilistic World Modeling Should Not Be Based on Token logprobs
-
This position paper argues that using the token softmax probabilities (logprob) of LLMs as "world event probabilities" is theoretically incorrect—because distribution estimation, response prediction, and target distribution estimation are three distinct tasks, each with a different ideal output distribution. The correct approach to obtaining world probabilities is second-order prediction—having the LLM explicitly write out its probability estimate for an event (numerically or with linguistic hedges) in its output, rather than computing "the probability it says X".
- Game of Thought: Robust Information Seeking with Large Language Models Using Game Theory
-
This paper models LLM active questioning scenarios (20 Questions / medical diagnosis / troubleshooting) as a two-player zero-sum extensive-form game (EFG), and proposes Game of Thought (GoT): using depth-limited subgame construction + CFR to compute Nash equilibria, thereby generating "randomized questioning strategies" that significantly reduce worst-case interaction rounds across all datasets, with a 15–40% improvement over UoT under the weighted variant.
- GRPO is Secretly a Process Reward Model
-
This paper theoretically proves that GRPO + ORM, under the mild condition of "intra-group trajectory shared prefixes," is equivalent to a process reward RL objective with Monte-Carlo PRM, thereby revealing a hidden bug in vanilla GRPO—uneven prefix lengths cause most tokens in high-reward trajectories to receive negative advantage. The authors propose \(\lambda\)-GRPO, which performs PRM-aware normalization, consistently outperforming GRPO on reasoning benchmarks and achieving about 2× faster training.
- Hidden Error Awareness in Chain-of-Thought Reasoning: The Signal Is Diagnostic, Not Causal
-
A simple logistic regression probe on LLM hidden states during chain-of-thought (CoT) generation can predict whether the entire reasoning will be incorrect with 0.95 AUROC (0.79 from the first step), while a classifier trained on surface text achieves only 0.59; unfortunately, all four intervention methods (activation steering, probe-guided best-of-N, self-correction, activation patching) fail—this error signal is "diagnostic" rather than "causal."
- Lifting Traces to Logic: Programmatic Skill Induction with Neuro-Symbolic Learning for Long-Horizon Agentic Tasks
-
NSI "lifts" LLM agent interaction traces into neuro-symbolic workflow graphs with explicit conditional branches and dynamic variable binding, evolving skills from stateless scripts into state-aware logical programs. Achieves 98.0 / 76.5 / 95.2 success rates on ALFWorld / WebShop / TextCraft, comprehensively outperforming programmatic skill baselines like ASI and AWM.
- Many-Shot CoT-ICL: Making In-Context Learning Truly Learn
-
This paper systematically reveals that the "rules of thumb" for many-shot ICL in non-reasoning tasks completely fail in CoT reasoning tasks—similarity retrieval is actually harmful, and order sensitivity increases with the number of shots. The paper reinterprets successful many-shot CoT as "in-context test-time learning," and proposes the CDS method, which sorts demonstrations by embedding trajectory curvature, achieving a 5.42 pp improvement on 64-shot geometry problems.
- Multimodal Fact-Level Attribution for Verifiable Reasoning
-
MURGAT is the first benchmark to evaluate MLLMs’ ability to provide "fact-level, modality+timestamp precise citations" in multimodal reasoning outputs. It introduces a three-step evaluation protocol (verifiable claim identification → atomic fact decomposition → attribution quality) and a highly human-aligned automatic evaluator, MURGAT-SCORE (Pearson 0.84). The study reveals that even strong models often cite incorrectly despite correct answers, and that strong reasoning often comes at the expense of verifiable citation.
- Prism: Efficient Test-Time Scaling via Hierarchical Search and Self-Verification for Discrete Diffusion Language Models
-
The authors decompose the problem of "efficient test-time scaling for discrete diffusion language models (dLLM)" into three components: allocating computation along a hierarchical timeline of "exploration → progressive pruning → refinement" (HTS), using partial remask for local branching to preserve high-confidence "logic skeletons," and treating the dLLM itself as a Yes/No verifier (SVF). Ultimately, on four math/code benchmarks and three dLLMs, they achieve comparable or better accuracy than best-of-\(N\) with far fewer NFE.
- Provable Benefit of Curriculum in Transformer Tree-Reasoning Post-Training
-
This work provides the first rigorous sample complexity proof for "easy-to-hard" curriculum RL post-training: on the state-conditional autoregressive reasoning tree of a transformer, if the curriculum ensures that the difficulty ratio between adjacent stages is at most the \(L/p\)-th root of the target difficulty, then the total sample count can be reduced from the exponential \((C^\star)^L\) of direct training to the polynomial \(L\cdot (C^\star)^{p_\max}\) of curriculum-based training.
- ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning
-
ResRL theoretically decomposes the "negative sample gradient polluting positive sample" phenomenon (Lazy Likelihood Displacement) in RLVR into two components: "logit × representation." It then applies a projection residual at the representation layer using the SVD low-rank subspace of positive samples, assigning each negative token a gradient weight in \([\xi,1]\) based on its "orthogonal component energy"—the more similar the representation to positive samples (smaller residual), the lighter the penalty; only purely erroneous components are heavily penalized. This preserves Pass@1 while maintaining Pass@k diversity. On Qwen3-4B math tasks, Avg@16 improves by 9.4% and Pass@128 by 7.0% over NSR.
- ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning
-
The authors translate the human-annotated solution steps of the MATH dataset into "reusable Python tools with descriptions and type signatures," constructing the ToolMATH benchmark with 8K problems and 12K tools. It covers long-horizon multi-tool composition (hop 1-8+), controllable distractor tool similarity (5 levels × 4 densities), and scenarios where all gold tools are removed. Validation shows that the dominant failure factor is not tool selection but reasoning itself—thought errors account for over 90%, and distractor tools amplify early minor deviations into irreversible execution drift.
- Unlocking Zero-Shot Geospatial Reasoning via Indirect Rewards
-
The authors use "whether a ground street view and a satellite image can be localized to the same coordinate" as a verifiable indirect reward, and apply two-stage post-training (CoT scaffolding + RL self-exploring) with GRPO to Qwen2.5-VL-7B. This enables the model to learn general reasoning abilities that can zero-shot transfer to 25+ geospatial tasks using only GPS metadata.
🎮 Reinforcement Learning¶
- CAMEL: Confidence-Gated Reflection for Reward Modeling
-
This paper observes that the log-probability margin of the verdict token is highly correlated with judgment accuracy. Based on this, CAMEL is proposed: it first makes a quick preference judgment using a single token, only triggering reflective generation when confidence is low. Counterfactual prefix augmentation is used to enhance GRPO training for self-correction. On three reward model benchmarks, a 14B parameter model achieves an average accuracy of 82.9% (surpassing the previous best 70B model by 3.2%).
- CPMöbius: Iterative Coach–Player Reasoning for Data-Free Reinforcement Learning
-
Transforms self-play from "adversarial" to "collaborative": the Coach generates problems, the Player solves them, and the Coach receives a reward equal to "Player improvement × Player solve rate." Without any external training data, Qwen2.5-Math-7B-Instruct achieves an average +4.9 and OOD +5.4 across six math benchmarks, surpassing existing unsupervised methods like RENT/R-Zero.
- DR.Q: Debiased Model-based Representations for Sample-efficient Continuous Control
-
DR.Q builds upon the MR.Q "model-based representation + actor-critic" framework with two key additions: (1) explicitly maximizes the mutual information between \(z_{sa}\) and the next state representation \(z_{s'}\) using InfoNCE; (2) introduces "faded prioritized replay," a fusion of "PER × forget," to mitigate overfitting to early experiences. With a single hyperparameter set, DR.Q outperforms strong baselines such as SimBaV2, MR.Q, and TDMPC2 across 73 continuous control tasks.
- EARL: Towards a Unified Analysis-Guided Reinforcement Learning Framework for Egocentric Interaction Reasoning and Pixel Grounding
-
EARL employs a "coarse analysis–fine response" two-stage MLLM framework to unify egocentric interaction understanding tasks (description + QA + pixel mask) into a single pipeline: the first stage outputs a global description of the entire image and uses the last hidden state as a semantic prior, which is then injected into the second stage via a novel Analysis-guided Feature Synthesizer. Joint training is performed using GRPO and three types of rewards (format/answer/grounding accuracy). On Ego-IRGBench, EARL surpasses Seg-Zero in cIoU by 8.37%.
- From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents
-
To address two major bottlenecks in post-training "multi-turn interactive tool-using agents"—the high cost of quality data and RL signal corruption from user simulation noise—the authors propose "self-evolving multi-agent data synthesis (AReaL-SEA)" paired with executable verifiers as rewards. Combined with an RL recipe of "first SFT the user model, then large batch + dynamic filtering GRPO," this approach pushes Qwen3-235B to Airline 73.0 / Telecom 98.3 pass^1 on τ²-bench, matching or surpassing Claude/Gemini/GPT-5 across the board.
- How Reasoning Evolves from Post-Training Data: An Empirical Study Using Chess
-
The authors use "training LLMs to play chess" as a clean, verifiable RL testbed, systematically comparing the impact of six custom SFT datasets on RL. They find that "directly predicting the best move" achieves the highest scores but leads to unfaithful reasoning after RL, while "predicting the best line" yields comparable performance but more stable and faithful reasoning post-RL. Three metrics are distilled for predicting RL end performance from SFT checkpoints. Ultimately, a 7B model surpasses gpt-oss-120b on multiple chess benchmarks.
- Long-Horizon Model-Based Offline Reinforcement Learning Without Explicit Conservatism
-
This work challenges the mainstream consensus that "offline RL must be explicitly conservative," and proposes Neubay: adopting a Bayesian perspective on the posterior model ensemble, using long-horizon rollouts (hundreds of steps) to naturally absorb value overestimation, and controlling compounding error via layer norm and uncertainty thresholds. As a result, Neubay matches SOTA conservative algorithms on 33 D4RL/NeoRL datasets without pessimistic penalties, and sets new records on 7 datasets.
- Multi-Agent Decision-Focused Learning via Value-Aware Sequential Communication
-
SeqComm-DFL treats "multi-agent communication" as a predictor and "joint policy selection" as a downstream optimizer. By employing value-aware message generation, Stackelberg sequential conditioning, and implicit differentiation for bilevel optimization, it directly aligns communication learning with team return. This approach achieves a 4-6x cumulative reward improvement and over 13 percentage points increase in win rate on hospital scheduling and SMAC benchmarks.
- Path-Coupled Bellman Flows for Distributional Reinforcement Learning
-
Explicitly weaves the affine transport geometry of the distributional Bellman equation into the flow matching path: uses a shared base noise to simultaneously drive the paths of the current and successor states, and leverages a \(\lambda\) control variate to trade off bias and variance, resulting in a distributional critic that is source-consistent, Bellman endpoint-consistent, and stable.
- Probing RLVR Training Instability through the Lens of Objective-Level Hacking
-
The authors propose the "objective-level hacking" framework, attributing the increasing training-inference discrepancy in MoE large models under RLVR to biased pseudo-signals introduced by token-level weight distortion in the optimization objective. Through four sets of experiments on a 30B MoE, they verify that bias (not variance) is the root cause.
- QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL
-
QHyer replaces the trajectory-dependent RTG in Decision Transformer with state-dependent Q-values estimated by Normalizing Flows, and stacks a gated Hybrid Attention-Mamba backbone to achieve content-adaptive historical compression. It sets new SOTA on both non-Markovian and Markovian offline goal-conditioned RL datasets (OGBench/D4RL).
- R2R2: Robust Representation for Intensive Experience Reuse via Redundancy Reduction in Self-Predictive Learning
-
R2R2 incorporates VICReg-style redundancy reduction constraints into self-predictive learning (SPL) to stabilize high UTD training, with the key modification being the omission of zero-centering—theoretically proving that zero-centering removes the constant eigenmode (i.e., global dynamics information) in the spectral decomposition of SPL. Experiments show that on TD7 with UTD=20, the score increases from 1.02 to 1.24 (+22%), and the newly proposed SimbaV2-SPL architecture achieves new SOTA in continuous control.
- Recovering Hidden Reward in Diffusion-Based Policies
-
EnergyFlow explicitly parameterizes the score field of diffusion policy as the negative gradient of a scalar energy function, and proves that under maximum-entropy optimality, the score equals the gradient of the soft Q-function. This provides a scalar signal usable as a downstream RL shaping reward "for free" without adversarial optimization, while the conservative field constraint improves OOD generalization.
- SPHERE: Mitigating the Loss of Spectral Plasticity in Mixture-of-Experts for Deep Reinforcement Learning
-
This paper formalizes the plasticity loss of MoE policies in continual reinforcement learning as a decline in the empirical NTK matrix spectral entropy effective rank, reduces it via Gauss-Newton and Kronecker decomposition to a computable proxy dependent only on the "expert feature Gram matrix," and finally uses a one-line Parseval penalty (SPHERE) to increase this proxy. On MetaWorld and HumanoidBench continual RL settings, task success rates are improved by 133% and 50%, respectively.
- Stochastic Minimum-Cost Reach-Avoid Reinforcement Learning
-
This paper proposes the Reach-Avoid Probability Certificate (RAPC), which uses a max-min-clamped Bellman contraction operator to lower-bound the value function by the reach-avoid probability. It introduces an adversarial \(\gamma^T\)-decay "compensation factor" for normalization, and employs symmetric gradient projection to jointly optimize the conflicting objectives of "cost" and "reach-avoid probability." On MuJoCo, it achieves both lower cumulative cost and higher reach success rate than RC-PPO / RESPO / CPPO.
- T\(^2\)PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning
-
T\(^2\)PO attributes training collapse in multi-turn agentic RL to "hesitation"—overthinking at the token level and repeated ineffective actions at the turn level. It introduces a self-calibrated uncertainty signal \(M_t\) (combining entropy and confidence) to simultaneously drive token-level Thinking Intervention (dynamic truncation of think segments) and turn-level Dynamical Sampling (resampling ineffective turns). This approach consistently outperforms PPO/GRPO/GiGPO on WebShop, ALFWorld, and Search QA, achieving stable training.
- Towards Efficient and Expressive Offline RL via Flow-Anchored Noise-conditioned Q-Learning
-
This paper proposes FAN: compressing "expensive generative policy + distributional critic" into "single-step flow anchoring + single noise-sample critic"—using Flow Anchoring to complete behavior regularization within one flow evaluation, and replacing quantile multi-sample with a single Gaussian noise sample in the noise-conditioned critic. Achieves SOTA performance on D4RL/OGBench while training 5-14× faster than comparable distributional methods.
- Trajectory-Level Data Augmentation for Offline Reinforcement Learning
-
This paper proposes LIFT: in active localization tasks, it leverages the geometric properties of trajectories to "shortcut" redundant zig-zag paths left by suboptimal logging policies, synthesizes these transitions, and feeds them to a lightweight augmentor that replaces logging actions during data collection. This enables offline CQL to significantly outperform standard offline RL and warm-start SAC across low- to high-dimensional, partial observation, and other settings.
- Turning Drift into Constraint: Robust Reasoning Alignment in Non-Stationary Multi-Stream Environments
-
This work reinterprets the reasoning "drift" among multiple MLLMs as negative sample constraints in DPO, using Plackett-Luce preference loss to simultaneously suppress the divergent trajectories of N source models. As a result, a 7B student model, without ground-truth reports and using only 10% of MIMIC-CXR, surpasses all source teachers in chest X-ray classification and report generation tasks.
- Vulnerable Agent Identification in Large-Scale Multi-Agent Reinforcement Learning
-
This paper investigates the bi-level NP-hard problem of "identifying the K most vulnerable agents in a large-scale MARL system with N agents." The problem is formulated as HAD-MFC (Hierarchical Adversarial Decentralized Mean Field Control). The authors use the Fenchel-Rockafellar transform to fold the lower-level worst-case adversarial policy training into a regularized "robust mean-field Bellman operator," and convert the upper-level combinatorial selection into an MDP with dense rewards, solvable via greedy or RL methods. They prove the decomposition preserves optimality and outperform baselines in 17 out of 18 tasks.
🧮 Scientific Computing¶
- A Call to Lagrangian Action: Learning Population Mechanics from Temporal Snapshots
-
Starting from the principle of least action, this paper proposes the Wasserstein Lagrangian Mechanics (WLM) framework to learn second-order population dynamics instead of traditional first-order gradient flows. This enables the capture of richer collective phenomena such as periodicity and rotation, and allows interpolation and future prediction without requiring a reference process.
- ANTIC: Adaptive Neural Temporal In-situ Compressor
-
To enable "on-the-fly" compression of PB-EB scale PDE simulation data, this work proposes ANTIC: a physics-aware temporal selector retains only physically important snapshots, and a neural field with LoRA continually fine-tunes to encode residuals between adjacent snapshots. Achieves 435× compression on 2D Kolmogorov flow and 6807× spatiotemporal compression on a 4.2 TiB 3D binary black hole merger simulation.
- Discovering Ordinary Differential Equations with LLM-Based Qualitative and Quantitative Evaluation
-
DoLQ inserts a "Scientist Agent" into the search loop of LLM symbolic regression, performing both qualitative (physical plausibility) and quantitative (ablation MSE contribution) evaluations on candidates. This approach forces LLM-SR, which typically produces "low-error but bloated and physically nonsensical" equations, to converge to equations that are both numerically accurate and structurally compact.
- Flow Sampling: Learning to Sample from Unnormalized Densities via Denoising Conditional Processes
-
This paper proposes Flow Sampling, which reverses flow matching/diffusion models from "data-driven" to "noise-driven"—constructing a denoising diffusion drift conditioned on source noise samples. On the interpolant, the detached model samples the energy gradient of \(X_1\) as the regression target, enabling the learning of efficient diffusion samplers in the absence of data, and naturally generalizing to constant curvature Riemannian manifolds.
- Mesh Field Theory: Port–Hamiltonian Formulation of Mesh-Based Physics
-
Starting from four physical principles—locality, permutation equivariance, orientation covariance, and energy conservation/dissipation inequality—this work proves that any mesh-based physical dynamics satisfying these axioms can be locally reduced to a port-Hamiltonian form at the Jacobian level. In this form, the conserved interconnection structure \(J\) is entirely determined by the mesh topology (signed incidence matrix \(D_k\)), while the metric and dissipation are learned through \(G\) and \(R\). The proposed MeshFT-Net achieves near-zero energy drift, correct dispersion and momentum over long rollouts, and significantly outperforms MGN and HNN.
- Meta-learning Structure-Preserving Dynamics
-
Systematically introduces modulation-based meta-learning (hyper-network maps latent code \(\bm{z}^{(k)}\) to hierarchical modulation parameters) into Hamiltonian / GENERIC neural networks, proposing two novel modulations—latent multi-rank (MR) and latent SVD-like modulation—enabling a shared network to few-shot adapt to a family of new parameter instances without knowing system parameters \(\bm{\mu}\), while strictly preserving energy conservation/dissipative structure.
- MOOSE-Star: Unlocking Tractable Training for Scientific Discovery by Breaking the Complexity Barrier
-
MOOSE-Star decomposes the problem of "training an LLM to directly generate scientific hypotheses"—which originally requires searching a \(\mathcal{O}(N^k)\) combinatorial space—into two sequential subtasks: "inspiration retrieval + hypothesis composition." By further stacking hierarchical tree retrieval, bounded composition, and motivation planning, the optimal complexity is reduced from exponential to \(\mathcal{O}(\log N)\). The authors also release the TOMATO-Star dataset with 108,717 decomposition-annotated papers.
- Phy-CoSF: Physics-Guided Continuous Spectral Fields Reconstruction and Super-Resolution for Snapshot Compressive Imaging
-
A train-render two-phase deep unfolding framework for snapshot compressive spectral imaging (CASSI), enabling arbitrary wavelength querying. Each unfolding stage incorporates a continuous spectral field (CoSF) prior module, consisting of a Fourier-Mamba-driven triple-branch cross-domain feature mixer, random frequency encoding, and a spectral synthesis head. Training on discrete wavelengths enables inference at any continuous wavelength, achieving continuous spectral reconstruction and zero-shot spectral super-resolution.
- PODiff: Latent Diffusion in Proper Orthogonal Decomposition Space for Scientific Super-Resolution
-
PODiff moves diffusion models from pixel space to a fixed, variance-ordered POD coefficient space, enabling a tiny MLP to achieve pixel-level diffusion accuracy on \(640\times 480\) SST downscaling tasks. Since reconstruction is linear, ensemble variance can be analytically mapped back to physical space via \(\Sigma_u=\Phi\Sigma_a\Phi^\top\), yielding spatially interpretable and well-calibrated uncertainty.
- Rethink the Role of Neural Decoders in Quantum Error Correction
-
This work systematically re-implements five types of neural decoders—MLP, 3D-CNN, TCN, Transformer, and GNN—on surface codes with \(d\le9\), and incorporates "quantization + pruning + FPGA resource modeling" as first-class citizens in the training pipeline. The conclusion is: recent decoding performance is dominated by data volume rather than architectural complexity, and INT4 + QAT is a necessary prerequisite for achieving microsecond-level real-time decoding.
- Saving Foundation Flow-Matching Priors for Inverse Problems
-
To address the phenomenon where foundation flow-matching models such as Stable Diffusion / Flux perform significantly worse than domain-specific or even untrained priors on inverse problems, the authors propose FMPlug: a method that uses a sample-guided, time-learnable warm-start combined with a sharp Gaussian shell constraint to force the latent variables of the foundation FM back onto the thin shell where it was actually "trained," thereby significantly restoring its effectiveness as a prior for inverse problems.
- Semi-Supervised Neural Super-Resolution for Mesh-Based Simulations
-
SuperMeshNet employs two complementary MPNNs: the primary model predicts LR→HR, while the auxiliary model predicts the HR-HR difference corresponding to LR-LR. These models generate pseudo-labels for unpaired HR samples through mutual supervision. Combined with lightweight inductive biases at the node/message levels, this approach enables PDE mesh super-resolution to surpass fully supervised baselines using only 10% HR data, consistently reducing RMSE across six MPNN architectures.
- Skipping the Zeros in Diffusion Models for Sparse Data Generation
-
SED transforms diffusion models from "full dense denoising across all dimensions" to "diffusion only on nonzero dimensions + autoregressive decoding of dimension-value pairs," reducing computation from linear in total dimensions to nearly constant in the number of nonzeros, while strictly preserving the semantic meaning of explicit zeros in scientific data.
- Smoothness Errors in Dynamics Models and How to Avoid Them
-
The authors theoretically identify that Kiani et al.'s "unitary GNN" overly constrains physical systems like heat diffusion, which naturally smooth over time, due to its strict preservation of the Rayleigh quotient. They propose "relaxed unitary convolution" (R-UniGraph / R-UniMesh), extending the Rayleigh quotient-unitary convolution framework from graphs to triangular meshes, achieving superior performance over strong baselines on MeshPDE and WeatherBench22.
- (Sparse) Attention to the Details: Preserving Spectral Fidelity in ML-based Weather Forecasting Models
-
MOSAIC addresses two types of spectral degradation in ML-based weather forecasting models—spectral damping from deterministic averaging and high-frequency aliasing from latent space compression—by combining probabilistic perturbation with mesh-aligned block-sparse attention on the HEALPix spherical mesh. With only 214M parameters at 1.5° resolution, it matches or surpasses models at 6× higher resolution, generating 24-member 10-day forecasts in 12 seconds on a single H100.
- Teaching Molecular Dynamics to a Non-Autoregressive Ionic Transport Predictor
-
This work treats expensive atomic trajectories as a "privileged auxiliary modality" during training. A bimodal trainer first learns dynamics from trajectories, then distills its hidden representations via closed-form ridge regression into a non-autoregressive predictor that only sees equilibrium structures. On lithium ion mean squared displacement prediction, it is 200× faster and more accurate than autoregressive SOTA.
- Topology-Preserving Neural Operator Learning via Hodge Decomposition
-
This paper proposes the Hodge Spectral Duality (HSD) neural operator, which decomposes the solution operator of manifold PDEs via Hodge orthogonal decomposition into a "low-frequency topological component (spectral basis) + high-frequency geometric component (FNO auxiliary grid)" dual-branch structure. A commutator correction term couples the two, enabling both high accuracy and conservation law fidelity on complex meshes.
- Unbiased and Second-Order-Free Training for High-Dimensional PDEs
-
This paper addresses the discretization bias in EM-BSDE training loss by proposing Un-EM-BSDE: single-step errors are averaged over two independent groups of Monte Carlo subsamples and then "multiplied" to form an unbiased estimator, eliminating bias without requiring Hessians. On benchmark PDEs such as HJB/BSB/AC, it matches the accuracy of Heun-BSDE / FS-PINNs but with only 1.79× the training time of EM-BSDE (compared to 42.91× for Heun-BSDE and 32.07× for FS-PINNs).
- WeatherSyn: An Instruction Tuning MLLM For Weather Forecasting Report Generation
-
WeatherSyn decomposes the workflow of meteorological forecasters' report writing into a multimodal instruction task of "image interpretation → key point listing → report drafting." It first constructs the WSInstruct dataset, covering 31 US cities and 8 weather aspects, and then applies a three-stage SFT→RFT→DPO fine-tuning process on Qwen3-VL-8B. This enables an 8B open-source model to consistently outperform closed-source large models such as GPT-5-Nano and Claude-3.7-Sonnet across various evaluation metrics, while also demonstrating zero-shot generalization to unseen cities.
🔒 LLM Safety¶
- From Flat Facts to Sharp Hallucinations: Detecting Stubborn Errors via Gradient Sensitivity
-
This work shifts LLM hallucination detection from "output probability analysis" to "loss landscape curvature": by injecting Gaussian noise into embeddings and measuring the perturbation in gradient direction and magnitude as a cheap proxy for the Hessian spectral radius, the method outperforms entropy, Semantic Entropy, EigenScore, and other baselines in AUROC across 12 model-dataset combinations.
- From Parameter Dynamics to Risk Scoring: Quantifying Sample-Level Safety Degradation in LLM Fine-tuning
-
By tracking the cumulative parameter drift along "dangerous/safe directions" during LoRA fine-tuning, the authors reveal that the fundamental mechanism behind benign data breaking alignment is the monotonic drift of parameters toward the dangerous direction during fine-tuning. They propose SQSD—assigning each sample a continuous risk score based on the difference in single-step gradient projections along these two directions. SQSD maintains monotonic ASR ranking across 3 models × 2 datasets and generalizes across architectures, scales, and LoRA→Full transfer.
- Harnessing Reasoning Trajectories for Hallucination Detection via Answer-agreement Representation Shaping
-
This paper proposes ARS for hallucination detection in large reasoning models (LRMs): instead of perturbing the reasoning trace at the text level, it directly applies small perturbations to the latent representation at the end of the trace and continues decoding to obtain counterfactual answers. Using "answer agreement" as a label, a lightweight contrastive head is trained to shape the trace-conditioned answer embedding, enabling subsequent embedding-based detectors to better separate hallucinations from truthful answers (AUROC on TruthfulQA improves from \(66.85\to 86.64\)).
- Inducing Overthink: Hierarchical Genetic Algorithm-based DoS Attack on Black-Box Large Language Reasoning Models
-
This paper targets the vulnerability of large reasoning models (LRMs) to "overthinking" triggered by logically deficient inputs. It proposes a hierarchical genetic algorithm (HGA) that, under pure black-box conditions, treats structurally decomposed questions as genes. Through sentence-level and question-level crossover and mutation (addition/deletion), it searches for adversarial samples with logical breaks. On MATH, it can amplify response length by up to 26.1×, enabling low-cost DoS attacks.
- Internalizing Safety Understanding in Large Reasoning Models via Verification
-
This paper argues that "being able to generate safe answers" ≠ "understanding safety," and proposes the SInternal framework: training large reasoning models solely to verify the safety of their own generated answers. The resulting emergent internal safety understanding significantly suppresses jailbreak attacks (StrongREJECT ASR drops from 41% to 0.6%) and provides a better starting point for subsequent RL.
- Jailbreaking Vision-Language Models Through the Visual Modality
-
The authors propose four attacks that jailbreak state-of-the-art VLMs solely via visual input (visual cipher / object replacement / text replacement / visual analogy riddles). Systematic evaluation on six advanced VLMs demonstrates that "safety alignment on the text side does not automatically transfer to the visual side," and mechanistic analysis reveals the underlying hierarchical mechanisms.
- Less Diverse, Less Safe: The Indirect But Pervasive Risk of Test-Time Scaling in Large Language Models
-
This paper reveals a previously overlooked failure mode of Test-Time Scaling (TTS): by simply reducing the diversity of candidate responses, TTS becomes even more prone to outputting unsafe content than directly feeding adversarial prompts. The authors propose RefDiv, a genetic algorithm driven by dual signals—Shannon entropy and reference guidance—which efficiently jailbreaks across models, closed-source systems, and guardrails on both MCTS and Best-of-N.
- Metis: Learning to Jailbreak LLMs via Self-Evolving Metacognitive Policy Optimization
-
Reformulates multi-turn jailbreak as an inference-time policy optimization problem—within an adversarial POMDP framework, the Attacker and Metacognitive Evaluator form a closed loop: dense analytical feedback from the Evaluator is used as a "semantic gradient" to guide the Attacker's belief update and policy improvement. This enables adaptation to 10 cutting-edge models (including O1 / GPT-5-chat / Claude-3.7) with an average ASR of 89.2%, while reducing token consumption by 8.2× compared to strong baselines, all without retraining any weights.
- MultiBreak: A Scalable and Diverse Multi-turn Jailbreak Benchmark for Evaluating LLM Safety
-
MultiBreak employs an iterative framework of "active learning + uncertainty-guided rewriting" to expand a multi-turn jailbreak dataset to 10,389 dialogues and 2,665 unique harmful intents, achieving a diversity of 0.942 that far surpasses previous work. On DeepSeek-R1-7B / GPT-4.1-mini, it improves ASR by 54% / 34.6% over the next-best dataset.
- OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents
-
OTora introduces a novel attack paradigm, Reasoning-Level Denial-of-Service (R-DoS): without compromising task correctness, it employs a two-stage red teaming pipeline (first using insertion-aware optimization to induce the agent to proactively access attacker-controlled external resources, then deploying "reasoning-type payloads" optimized via ICL genetic search at those resources) to trap LLM agents in prolonged multi-turn overthinking states. On WebShop, Email, and OS agents, this achieves up to 10× reasoning token inflation and orders-of-magnitude latency attacks, with final task accuracy nearly unchanged.
- REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
-
REALISTA constructs an "input-dependent edit direction dictionary" in the LLM latent space, turning adversarial prompt optimization into a continuous problem under a simplex constraint. This approach preserves the semantic equivalence/coherence of discrete methods like SECA, while achieving the search flexibility of continuous methods like LARGO. It is the first to successfully induce hallucinations in free-form outputs of closed-source inference models such as GPT-5.
- SafeHarbor: Defining Precise Decision Boundaries via Hierarchical Memory-Augmented Guardrail for LLM Agent Safety
-
SafeHarbor upgrades LLM Agent safety from "static coarse-grained classifiers" to "dynamic hierarchical memory tree + dual-score gating." Through adversarial rule generation and entropy-driven self-evolution, GPT-4o maintains a 93%+ refusal rate while raising benign tool invocation success to 63.6%, significantly alleviating the over-refusal problem.
- Safety Anchor: Defending Harmful Fine-tuning via Geometric Bottlenecks
-
This paper proves that all existing HFT defenses that impose constraints in parameter space can be circumvented due to parameter redundancy. It proposes Safety Bottleneck Regularization (SBR), which shifts the defense to the unembedding layer—a geometric bottleneck: by anchoring only the final hidden state of a single high-risk prompt, the Harmful Score can be suppressed to < 10 under 50 epochs of sustained HFT attack, without harming benign task accuracy.
- Self-Debias: Self-correcting for Debiasing Large Language Models
-
Self-Debias reframes LLM debiasing as "fair resource allocation of probability mass along the autoregressive reasoning chain": it uses trajectory-level suffix margins as resource units, applies the Jain fairness index to prevent resource collapse on easy samples, and combines cold-start SFT with consistency-filtered online self-training. With only 20k labeled seeds, it boosts Qwen3-8B's average score across 8 fairness/utility benchmarks from 77.5 to 81.7, and reverses the base model's "self-correction collapse" into a stable +0.4 improvement.
- Stable-GFlowNet: Toward Diverse and Robust LLM Red-Teaming via Contrastive Trajectory Balance
-
This paper identifies two major sources of instability in existing GFlowNet red-teaming: high variance from partition function \(Z_\theta\) estimation, and mode collapse caused by noisy rewards from toxicity classifiers on OOD gibberish text. The authors propose three simple components—pairwise contrastive objective CTB to eliminate \(Z\), Noisy Gradient Pruning to filter uninformative pairs, and Min-K Fluency Stabilizer to block gibberish—which together boost the number of unique attacks on Qwen2.5-1.5B from 17 to 134 (about 7×), maintain a 92% ASR, and outperform baselines in cross-model/cross-defense transferability.
- STARE: Step-wise Temporal Alignment and Red-teaming Engine for Multi-modal Toxicity Attack
-
This paper treats the entire denoising trajectory of T2I models as the "attack surface" for VLM red-teaming attacks. It proposes a hierarchical RL framework (STARE) combining a high-level prompt editor and low-level GRPO fine-tuning of rectified-flow models. This approach not only improves attack success rate by 68% over SOTA, but also reveals a novel phenomenon—Optimization-Induced Phase Alignment: adversarial optimization automatically binds "conceptual toxicity" to early denoising and "detail toxicity" to later stages, transforming the chaotic toxicity formation process into several predictable "vulnerability time windows."
- Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection
-
This work uses Causal Tracing to show that "refusal" in LLMs is not a static vector at the terminal token, but a "refusal trajectory" spanning upstream intermediate layers and tokens. Based on this, the authors design SALO—a detector with <20M parameters, trained only on standard alignment data, yet able to leverage the irreversibility of Transformer causal masks to identify adversarial attacks such as GCG, AutoDAN, and Prefilling. SALO raises detection rates from 0% to over 85% on GCG/Prefilling attacks.
- Watermarking LLM Agent Trajectories (ACTHOOK)
-
ACTHOOK introduces the "software hook" concept into agent trajectories: at action boundaries, it inserts an extra action triggered by a secret key as a watermark. LLMs trained on such data will execute the hook at significantly higher frequency when prompted with the key, enabling copyright detection via black-box queries only. The average AUC reaches 94.3 with almost no impact on downstream task performance.
📚 Pretraining¶
- Annotations Mitigate Post-Training Mode Collapse
-
The authors observe that SFT aligns models to a low-entropy semantic prior, leading to a "the larger the instruction model, the more boring" reverse scaling effect. They propose "annotation-anchored training": during pretraining, semantic tags are paired with documents; during SFT, loss on tag tokens is masked. At inference, the model first samples semantics, then generates responses, thereby narrowing the semantic diversity gap by 85% while retaining instruction-following ability.
- Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner
-
This work systematically compares continuous diffusion, discrete masked diffusion, and looped transformer from the perspectives of expressiveness and trainability, proving that "continuous diffusion" is strictly more expressive than discrete diffusion and can simulate looped transformers, but its practical performance is limited by decoding and representation space. Based on this, it proposes CCDD (Coevolutionary Continuous Discrete Diffusion)—diffusing simultaneously in the discrete token space and the contextual embedding space of a pretrained LLM, with a single model jointly denoising. On LM1B/OWT, it reduces perplexity by 25-35% compared to MDLM, and surpasses MDLM's 256-step performance with only 8 sampling steps.
- CoFrGeNet: Continued Fraction Architectures for Language Generation
-
This work introduces the function class of "continued fractions," known for optimal rational approximation, into language generation Transformers. CoFrNet replacement modules (CAttnU/CAttnM/Cffn) are designed for multi-head attention and FFN, respectively. By leveraging the closed-form "continuants," \(d\) divisions are reduced to a single division. On GPT2-xl and Llama-3.2B, downstream performance is matched or even improved with only \(\frac{2}{3}\sim\frac{1}{2}\) of the parameters.
- Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision
-
This paper proposes Compute as Teacher (CaT): leveraging the \(G\) rollouts already sampled by GRPO to "synthesize" a pseudo-reference answer via a frozen anchor model, then, in unverifiable domains, using the model itself to derive binary rubrics from the pseudo-reference to score each rollout as RL reward. This approach directly transforms inference compute into supervision signals without any human annotation. On HealthBench, CaT achieves up to 30% improvement over baselines and matches or surpasses inference-time aggregation with 9× lower test-time compute.
- Consistent Diffusion Language Models
-
This paper points out that discrete diffusion lacks a continuous-domain probability-flow ODE counterpart, making direct consistency modeling infeasible. The authors propose using an exact closed-form posterior bridge as a "stochastic PF-ODE surrogate" in the discrete domain, constructing a Multi-Path Discrete Consistency (MPDC) training objective. This requires the denoiser's predictions to be consistent in expectation across multiple stochastic bridge paths, enabling single-stage, teacher-free training of Consistent Diffusion Language Models (CDLM) that can generate high-quality text in 2-3 steps. CDLM achieves SOTA in unconditional/conditional text generation and up to \(32\times\) speedup over AR models.
- Data Difficulty and the Generalization--Extrapolation Tradeoff in LLM Fine-Tuning
-
This work systematically investigates the role of data difficulty in SFT, finding that there is no "universally optimal difficulty." Instead, the optimal difficulty shifts toward harder data as data scale increases. This is explained via a trade-off between the "in-distribution generalization gap" and the "extrapolation gap," with a PAC-Bayes interpretation.
- Decomposing the Basic Abilities of Large Language Models: Mitigating Cross-Task Interference in Multi-Task Instruct-Tuning
-
This paper addresses the issue of cross-task gradient conflict in multi-task instruct-tuning by proposing Badit: first, SVD is used to decompose pretrained weights into a set of naturally orthogonal, high-singular-value LoRA "basic ability" experts; then, during training, spherical K-means is used to dynamically orthogonally group rank-1 components, shifting the traditional paradigm of "parameter isolation by task" to "decoupling by basic ability." On six LLMs, Badit achieves an average improvement of 2.68 Rouge over GainLoRA.
- Edit-Based Refinement for Parallel Masked Diffusion Language Models
-
ME-DLM introduces a lightweight "decode-then-edit" stage to masked diffusion language models (e.g., LLaDA): the first stage performs standard unmasking to generate a draft, and the second stage applies parallel corrections using three types of token-level edits (replace/delete/insert). The supervision signal is derived from the shortest edit script (edit distance). With only 1/8 diffusion steps, it surpasses LLaDA-Instruct by +11.6 on HumanEval and +33.6 on GSM8K.
- Focus and Dilution: The Multi-stage Learning Process of Attention
-
In a simplified setting where a single-layer Transformer learns Markov data, this work analyzes gradient flow by performing stage-wise linearization around a sequence of critical points, rigorously characterizing the recurring "focus–dilution" cycles in attention training. Consistent phenomena are observed on WikiText and TinyStories.
- From Backward Spreading to Forward Replay: Revisiting Target Construction in LLM Parameter Editing
-
This paper systematically analyzes why backward spreading works in locate-then-edit (LTE) editing, why it is insufficient, and proposes forward replay: treating the first decisive layer as the optimization variable and obtaining subsequent layer targets via standard forward propagation. This approach consistently improves over MEMIT/RECT/PRUNE/AlphaEdit without extra computational cost.
- InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition
-
The authors propose InfoLaw: redefining "pretraining" as a process of "bucket-wise information accumulation," where the information in each bucket equals "quality density \(f_d\) × unique token count \(M_d\) × \(\log K\)" multiplied by an exponentially decaying factor with respect to repetition \(R_d\). The final validation loss is expressed as \(L = \alpha\cdot\text{info}^{-\beta}\), which can be fitted on 252M-1.2B and extrapolated to 7B / 425B tokens with an average error of 0.15% and a maximum of 0.96%. This formulation can be directly used to search for the optimal data recipe.
- Model Merging Scaling Laws in Large Language Models
-
The authors empirically establish, using 10,866 merged models, a dual-axis power law of the form \(L=L_*+BN^{-\beta}+A_0 N^{-\gamma}/(k+b)\): the base model size \(N\) determines the floor, the number of experts \(k\) determines the tail, and four mainstream merging methods (Average, TA, TIES, DARE) all share the same curve. This transforms the questions of "how many experts to merge" and "when to stop merging" into predictable, budgetable engineering problems.
- On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length
-
This paper uses a carefully controlled set of Sudoku/Rush Hour tasks—where "reasoning difficulty is fixed and only horizon length varies"—to systematically demonstrate that task horizon itself is an independent root cause of LLM agent RL training collapse. It proposes two horizon-reduction mechanisms, macro action and subgoal decomposition, which not only stabilize training but also enable strong zero-shot generalization to longer horizons (horizon generalization).
- Predicting Large Model Test Losses with a Noisy Quadratic System
-
This paper proposes the Noisy Quadratic System (NQS)—a mechanistic loss model that models LLM test loss as \(L(N, B, K)\) (model size / batch size / update steps), explicitly modeling batch size in scaling law for the first time. On Pythia + OWT2, it improves extrapolation prediction capability from Chinchilla’s ~20× compute to ~4000× compute.
- Softplus Attention with Re-weighting Boosts Length Extrapolation in Large Language Models
-
The authors decompose traditional Softmax attention into two independent components: "non-negativization + L1 normalization," and demonstrate that the truly critical part is L1 normalization rather than the exponential. They replace the exponential with Softplus plus a dynamic length scaling factor to obtain LSSA, and then apply a power function-based "re-weighting" to sharpen the attention. The resulting LSSAR maintains nearly unchanged validation loss at 16× the training length and enables GPT-109M to "rediscover" Newton's law of universal gravitation from trajectory data.
- Towards Understanding Continual Factual Knowledge Acquisition of Language Models: From Theory to Algorithm
-
The authors derive closed-form training dynamics on a simplified single-layer linear attention Transformer, proving that regularization methods can only alter convergence speed but cannot shift the convergence point (thus are almost doomed to fail in cFKA scenarios), while data replay can directly shift the convergence point and amplify oscillations to stabilize old knowledge. They further propose STOC, which prunes fragments based on token attention contribution and guides the pretrained model to generate replay corpora. STOC consistently suppresses forgetting better than LAMOL on synthetic + KnowEdit + IndustryCorpus legal corpora.
- Understanding Catastrophic Forgetting In LoRA via Mean-Field Attention Dynamics
-
The authors formulate Transformer self-attention as a mean-field particle system modeling token interactions, treat LoRA as a low-rank perturbation, and prove that forgetting is governed by two phase transition curves related to the "perturbation norm" and "network depth." They provide a long-term stability condition controlled by the eigenvalue gap of \(V\).
🏥 Medical Imaging¶
- Auditing Sybil: Explaining Deep Lung Cancer Risk Prediction Through Generative Interventional Attributions
-
This paper introduces S(H)NAP—a generative interventional framework based on 3D diffusion bridges for "removal + insertion"—which reverses the decision process of Sybil, a state-of-the-art lung cancer risk prediction model, into an LMPI (linear + second-order interaction model) comprising "nodule main effects + pairwise interactions + background." For the first time, it causally (rather than correlationally) audits Sybil’s reliance on in-hospital artifacts such as ECG electrodes and clothing metal fasteners, and reveals a severe "radial insensitivity" failure mode for peripheral lung nodules.
- EEG-Based Multimodal Learning via Hyperbolic Mixture-of-Curvature Experts
-
EEG-MoCE assigns each modality in EEG-based multimodal learning (emotion/sleep/cognition) a Lorentz manifold expert with learnable curvature, then uses curvature-aware attention for cross-modal fusion, where "higher curvature → richer hierarchy → higher fusion weight". On EAV/ISRUC/Cognitive datasets, cross-subject accuracy improves by +14.14%, +3.34%, and +7.98%, respectively.
- Evidential Reasoning Advances Interpretable Real-World Disease Screening
-
EviScreen employs a "normal + pathological" dual knowledge bank for region-level evidence retrieval, then performs evidential reasoning between the current case and retrieved evidence via cross-attention and self-attention. This provides both retrospective interpretability (which historical cases support the current decision) and localization interpretability (abnormality maps from contrastive retrieval). On four real-world external test sets, it achieves SOTA specificity at high recall.
- Federated Distillation for Whole Slide Image via Gaussian-Mixture Feature Alignment and Curriculum Integration
-
This paper proposes FedHD: in heterogeneous federated pathology scenarios, it performs "one-to-one" WSI feature-level distillation via Gaussian-mixture feature alignment, then gradually injects cross-institution synthetic features into local training through curriculum learning. This enables institutions to collaborate without sharing raw data or exchanging model parameters, and is compatible with heterogeneous MIL architectures and feature extractors. FedHD comprehensively outperforms existing federated and distillation baselines on TCGA-IDH / CAMELYON16 / CAMELYON17.
- From Holo Pockets to Electron Density: GPT-style Drug Design with Density
-
This work replaces the structure-based drug design condition from a "rigid empty pocket" to a "filler low-resolution electron cloud containing ligand and solvent," and proposes the first decoder-only autoregressive EDMolGPT. On DUD-E's 101 targets, it achieves a bioactive recovery of 41%, far surpassing previous ED-based methods.
- OT-Bridge Editor: Geometrically Constrained Stenosis Editing in Coronary Angiography via Entropic Optimal Transport
-
OT-Bridge Editor reformulates "editing a vessel stenosis in coronary angiography" as a "constrained entropic OT problem in the vessel-structure composite domain," leveraging Schrödinger Bridge with path-level geometric projection supervision to achieve pixel-level controllable synthetic angiograms. On the ARCADE public dataset, it achieves a 27.8% relative improvement in downstream stenosis detection mAP@0.5.
- Layer-Specific Fine-Tuning for Improved Negation Handling in Medical Vision-Language Models
-
NAST uses causal tracing to compute the causal contribution (CTE) of each layer in the CLIP text encoder for negation understanding, then applies these CTEs for layer-wise gradient scaling in LoRA fine-tuning. This significantly enhances the semantic sensitivity of medical VLMs in distinguishing "presence/absence of symptoms," reducing the affirmative-negation accuracy gap from 21.6% to 4.2%.
- Learning the Interaction Prior for Protein-Protein Interaction Prediction: A Model-Agnostic Approach
-
L3-PPI transforms the biological "L3 rule" (protein pairs with more length-3 paths are more likely to interact) into a learnable graph prompt: a pretrained GNN recognizes L3 patterns, a gating network generates virtual L3 paths and regularizes their count according to PPI labels, forming a plug-and-play classification head that boosts any PPI representation model by 2–4 points on average.
- Marrying Generative Model of Healthcare Events with Digital Twin of Social Determinants of Health for Disease Reasoning
-
This paper proposes DiffDT: a conditional Latent Diffusion framework that connects electronic health records (ICD-coded event sequences) with multi-organ biomarker digital twins (tabular features derived from brain/heart/liver/kidney imaging and brain functional connectivity SPD matrices). The key innovation is an SPD-VQVAE based on Cholesky decomposition, which reduces \(\mathcal{O}(N^3)\) SPD manifold diffusion to a manifold-preserving and efficient latent space. An AR model then performs multi-pathway disease reasoning via the mediation path “generate digital twin → predict next ICD.” On UKB, next-event prediction AUC for 1944 diseases reaches 0.91, setting a new SOTA.
- MEG-XL: Data-Efficient Brain-to-Text via Long-Context Pre-Training
-
MEG-XL performs masked token pre-training on 2.5 minutes (191k tokens) of MEG context (5–300× longer than prior work), then fine-tunes on a 50-word brain-to-text task. With only 1 hour of data, it matches the decoding accuracy of SOTA supervised methods trained on 50 hours, and significantly outperforms all brain foundation models.
- Protein Circuit Tracing via Cross-layer Transcoders
-
The authors adapt the cross-layer transcoder from NLP to the protein language model ESM2, proposing the ProtoMech framework, which recovers 79% downstream performance with less than 1% sparse latent circuits, and enables circuit-based steering to design high-fitness protein variants, outperforming baselines in over 70% of cases.
- Scaling Vision Transformers for Functional MRI with Flat Maps
-
By projecting 3D fMRI volumes into 2D "cortical flat maps" and feeding them as videos to a standard spacetime MAE-ViT, the authors train CortexMAE on 2.1K hours of HCP data: it dramatically outperforms SOTA in cognitive state decoding, validating flat maps as the "goldilocks zone" between voxel (volume) and region-averaged (parcellation) representations. They also release the first open-source fMRI foundation model benchmark Brainmarks, provide the first systematic scaling law for fMRI models, and report an honest null result: individual trait prediction still cannot beat a simple functional connectivity baseline.
- SIGMA: Structure-Invariant Generative Molecular Alignment for Chemical Language Models via Autoregressive Contrastive Learning
-
SIGMA enforces the alignment of hidden states for different SMILES permutations of the same molecule onto a unified trajectory using token-level contrastive loss, and introduces IsoBeam to prune isomorphic redundant paths during decoding, enabling sequence models to "think in chemical space by structure, not by string."
- SynerMedGen: Synergizing Medical Multimodal Understanding with Generation via Task Alignment
-
SynerMedGen proposes the "generation-aligned understanding" principle—deriving understanding tasks directly from the same paired synthetic data (CTS / MI / TIA tasks). It first uses two-stage training to enable the understanding branch to learn representations beneficial for synthesis, then transfers these to the latent flow matching generation branch. On 22 medical synthesis tasks, it outperforms both dedicated synthesis models and existing unified MLLMs.
- TD3B: Transition-Directed Discrete Diffusion for Allosteric Binder Generation
-
TD3B frames the design of agonists/antagonists as a "directional transition operator" generation task, using a directional Oracle + affinity gating + tree search amortized fine-tuning within a masked discrete diffusion framework. This enables a pretrained peptide generator to produce peptide sequences that can specifically bias protein conformational transitions toward activation or inactivation.
- Towards A Generative Protein Evolution Machine with DPLM-Evo
-
This work proposes DPLM-Evo, extending the discrete diffusion in protein language models from "mask substitution only" to "explicit modeling of substitution + insertion + deletion evolutionary edits." By decoupling variable-length observed sequences into an upsampled-length latent alignment space plus a context-aware evolutionary noise kernel, it enables variable-length evolutionary generation and trajectory-based protein post-editing/optimization, achieving SOTA on ProteinGym single-sequence variant effect prediction.
- Towards Universal Gene Regulatory Network Inference: Unlocking Generalizable Regulatory Knowledge in Single-cell Foundation Models
-
This work identifies that single-cell foundation models (scFM) contain rich gene regulatory knowledge that is obscured by "reconstruction-based pretraining." It introduces two probes—Virtual Value Perturbation and Gradient Trajectory—to distill pairwise gene features from frozen scFM that generalize across genes and datasets. On the BEELINE benchmark, AUPRC is improved from ~0.5 to 0.8–0.97, inaugurating a new paradigm of "Universal GRN Inference (UGRN)."
🛡️ AI Safety¶
- ACTG-ARL: Differentially Private Conditional Text Generation with RL-Boosted Control
-
This paper proposes a hierarchical framework, ACTG, which decomposes private text generation into two subtasks: feature learning and conditional text generation. Furthermore, it introduces Anchored RL, which combines reinforcement learning objectives with optimal N-out-of-K SFT anchors, thereby improving the instruction-following ability of the conditional generator while maintaining text fidelity. On biomedical data, it achieves a 20% MAUVE improvement over prior work.
- Angel or Demon: Investigating the Plasticity Interventions' Impact on Backdoor Threats in Deep Reinforcement Learning
-
The authors systematically evaluate, for the first time, the impact of 7 mainstream plasticity interventions (SAM/Shrink&Perturb/Weight Clip/SN/WD/LN/ReDo) on deep reinforcement learning (DRL) backdoor attacks (14,664 experiments), finding that only SAM is a "demon"—significantly exacerbating backdoor threats. Based on this, they propose the "Sweeper-Converter-Connector" robust backdoor injection framework and provide a detection signal based on loss landscape sharpness.
- Certified Robustness under Heterogeneous Perturbations via Hybrid Randomized Smoothing
-
This work extends randomized smoothing (RS) from "supporting only single continuous or discrete input" to the hybrid perturbation setting of "discrete tokens + continuous images." Through a hybrid Neyman–Pearson analysis, it derives a one-dimensional, continuous, invertible likelihood ratio CDF, transforming the originally combinatorial discrete knapsack problem into a solvable root-finding problem. The first model-agnostic certificate for "joint image-text insecurity" is provided on LLaVA-Guard multimodal safety filtering.
- DP-KFC: Data-Free Preconditioning for Privacy-Preserving Deep Learning
-
This paper proposes DP-KFC: based on the observation that "the scaling of the Fisher matrix is determined by architecture, and its correlation structure can be approximated by modality-level spectral statistics," it uses structured synthetic noise (pink noise \(1/f^\alpha\) for images, Zipf sampling for text) to probe the network and reconstruct the KFAC preconditioner, without consuming privacy budget or introducing distribution shift. Under strong privacy (\(\varepsilon\le 3\)), it consistently outperforms DP-SGD and public data preconditioning methods.
- Dual-branch Robust Unlearnable Examples
-
This paper proposes DUNE: extending unlearnable example (UE) perturbations from a single spatial domain to joint "spatial + color" dual-domain optimization, aligning perturbation features to shift-induced labels and enhancing with pre-trained model ensembles. On CIFAR-10 / ImageNet, DUNE remains robust against 7 mainstream defenses (including ECLIPSE, ISS-J, COIN), reducing average test accuracy by 14.95%–50.82% compared to 12 SOTA UE methods.
- Fair Dataset Distillation via Cross-Group Barycenter Alignment
-
This work reveals that dataset distillation (DD) amplifies biases present in the original data—rooted in the interaction between "subgroup sample size imbalance" and "subgroup representational separation." The authors propose COBRA: using the barycenter of subgroup representations (independent of group size) as the distillation target, which simultaneously reduces EOD and improves accuracy across multiple DD frameworks.
- FedHPro: Federated Hyper-Prototype Learning via Gradient Matching
-
To address the issue in prototype-based federated learning where "directly averaging local prototypes inherits client bias," this work introduces a set of learnable global hyper-prototypes. These are optimized on the server via gradient matching to simulate prototypes as if trained in a centralized manner. Combined with client-side contrastive and alignment losses, this approach significantly improves accuracy under heterogeneous scenarios.
- Frequency Matching in Spiking Neural Networks for mmWave Sensing
-
From a "mechanism-data alignment" perspective, this work proves that LIF spiking neurons are equivalent to a first-order IIR low-pass filter, and proposes setting the membrane decay coefficient \(\beta\) according to the discriminative spectrum of mmWave signals. This enables SNNs to achieve an average of 6.22% higher accuracy and 3.64× lower theoretical energy consumption than ANNs on four standard mmWave datasets.
- LAPRAS: Learning-Augmented PRivate Answering for Linear Query Streams
-
LAPRAS uses a predictor for "which queries will arrive" to split an online DP query stream into in-prediction and out-of-prediction categories. For in-prediction queries, it releases answers with low noise in one shot using the offline-optimal Matrix Mechanism. For out-of-prediction queries, it applies Smooth Allocation, estimating the total number of "unpredicted queries" online based on observed positions and allocating budget smoothly. When predictions are accurate, utility nearly matches the offline optimum; when predictions are poor, performance degrades gracefully to the online baseline.
- Limits of Convergence-Rate Control for Open-Weight Safety
-
The authors formalize "open-weight safety" as "how to delay the convergence speed of malicious fine-tuning," proving that the largest singular value of the Hessian spectrum is determined by the lower bound of the weight spectrum. Based on this, they design the SpecDef algorithm, which can strictly slow down first/second-order optimization. However, they also prove that any such convergence-rate control method can be circumvented by an attacker at the cost of a linear increase in model size.
- MetaMoE: Diversity-Aware Proxy Selection for Privacy-Preserving Mixture-of-Experts Unification
-
Multiple client-specific experts, each fine-tuned on private data, can be merged into a deployable MoE model without sharing private data. The core is to select "relevant and diverse" proxy samples from public data using relevance-weighted DPP, enabling proxy-aligned expert training followed by context-aware router training. This aligns expert behaviors with proxy supervision and significantly outperforms similarity-only proxy selection methods like FlexOlmo.
- Position: Embodied AI Requires a Privacy-Utility Trade-off
-
This position paper argues that privacy in embodied AI cannot be addressed by single-stage patches, but must be treated as an architectural, dynamic control signal spanning the entire lifecycle—across instruction, perception, planning, and interaction. The SPINE framework is proposed, leveraging an L1-L4 four-level privacy classification matrix to coordinate agent behavior at each stage.
- Privacy Amplification in Differentially Private Zeroth-Order Optimization with Hidden States
-
The authors provide the first convergent hidden-state DP upper bound for "differentially private zeroth-order optimization (DP-ZOGD)"—by designing a "directional + isotropic" hybrid noise mechanism and constructing an auxiliary process between two neighboring trajectories, they circumvent the technical barrier that zeroth-order updates lack global Lipschitzness. This reveals a previously unknown DP algorithmic principle: increasing the number of sampled directions per step \(K\) can actually reduce privacy loss.
- Scaling Unsupervised Multi-Source Federated Domain Adaptation through Group-Wise Discrepancy Minimization
-
Addressing the limitation that existing federated unsupervised multi-source domain adaptation (UMDA) methods can only handle 2–6 sources—becoming unstable or computationally infeasible as the number of sources increases—the authors propose GALA: all sources are randomly divided into small groups, and group-wise prediction distribution discrepancies are minimized (reducing \(O(N^2)\) pairwise alignment to linear complexity). Additionally, a centroid+temperature-based similarity weighting is stacked to select sources truly close to the target domain. On the newly constructed Digit-18 (18 sources) benchmark, the method converges stably and outperforms all baselines.
- The Synthetic Web: Adversarially-Curated Mini-Internets for Diagnosing Epistemic Weaknesses of Language Agents
-
This work constructs a programmatically generated "Synthetic Web" environment. By injecting a single high-credibility honeypot misinformation item at search rank 0, it causally demonstrates that cutting-edge LLM agents such as GPT-5 experience an accuracy drop from 65% to 18% under adversarial contamination at a 1-in-thousands rate. The models do not increase search effort and still answer with high confidence, revealing a deeply rooted "positional anchoring" failure mode.
- VPD-100K: Towards Generalizable and Fine-grained Visual Privacy Protection
-
The authors constructed a large-scale visual privacy dataset, VPD-100K, with 100,000 images, 33 fine-grained categories, and over 190,000 instances, covering four major domains (faces/on-screen PII/physical identifiers/location markers). They propose a three-part frequency-domain enhancement module (FDAF + Adaptive Spectral Gating + Frequency-domain Consistency Loss) inserted into the Neck of YOLOv10, boosting YOLOv10-L's AP on VPD-100K from 53.8 to 58.6 (+4.8), while maintaining stable real-time performance on live streams at 7.51ms latency.
🦾 LLM Agent¶
- A Minimal Agent for Automated Theorem Proving
-
This paper introduces AxProverBase—a minimal Lean 4 theorem-proving agent that, using only three components ("compiler feedback + self-managed notebook + lightweight tool search"), achieves or surpasses specialized systems like Hilbert/Seed-Prover on cutting-edge, untuned LLMs (Claude Opus), while reducing costs by 100x.
- Adaptive Querying with AI Persona Priors
-
The authors encapsulate the "distribution of LLM responses under persona conditions" as a finite mixture Bayesian prior, enabling efficient prediction of other responses for a user after only a few questions by performing closed-form posterior updates over persona, outperforming classic CAT/IRT baselines.
- Agent-Omit: Adaptive Context Omission for Efficient LLM Agents
-
By using Monte-Carlo rollout to quantify "which rounds of thought/observation can be omitted," and then training an 8B agent with cold-start SFT and dual-sampling omit-aware GRPO, the model adaptively skips redundant reasoning and observations. On five benchmarks, token usage drops significantly while accuracy matches seven leading models.
- AgentXRay: White-Boxing Agentic Systems via Workflow Reconstruction
-
The authors propose a new task, AWR, which aims to reconstruct an equivalent white-box workflow from a black-box agent system. They use MCTS to search the agent primitive sequence space, combined with a Red-Black pruning method based on dynamic score coloring to balance depth and breadth, achieving interpretable white-box reconstruction in five real-world domains.
- BioAgent Bench: An AI Agent Evaluation Suite for Bioinformatics
-
BioAgent Bench provides an end-to-end evaluation suite for "running bioinformatics pipelines with LLM agents"—10 real bioinformatics tasks × 10 frontier/open-weight models × 3 agent harnesses, combined with LLM judge scoring and three types of perturbation tests (corrupted/decoy/prompt-bloat). The study finds that frontier models can complete over 90% of pipelines, but robustness remains a concern.
- DiscoverLLM: From Executing Intents to Discovering Them
-
DiscoverLLM formalizes the scenario where "users themselves are unclear about what they want" as a progressive discovery process over a hierarchical intent tree. It uses a rewardable hierarchical user simulator to train models that actively diverge and explore when user intent is unclear, and converge to execution when intent is clear. On creative writing, technical writing, and SVG tasks, it outperforms baselines like CollabLLM by +10% in satisfaction and reduces dialogue length by 40%.
- EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle
-
EvolveR establishes a closed-loop lifecycle for LLM agents: "online interaction → offline self-distillation into a principle library → GRPO policy evolution." Instead of discarding past trajectories, the agent abstracts its own successes and failures into a retrievable set of "policy principles," then uses RL to learn how to leverage its own principles to solve new problems. On seven multi-hop QA benchmarks, it significantly outperforms RL agent baselines such as Search-R1.
- ExCyTIn-Bench: Evaluating LLM Agents on Cyber Threat Investigation
-
This paper introduces the first benchmark for evaluating LLM Agents in end-to-end "cyber threat investigation": ExCyTIn-Bench. From 57 real Azure tenant security log tables, it automatically generates 7,542 SQL QA tasks with evidence chains using an alert-entity bipartite graph, and provides a MySQL environment for agents to answer by querying logs and multi-hop evidence tracing. The current best model, Claude-Opus-4.5, achieves only a 0.606 reward.
- Group Cognition Learning: Making Everything Better Through Governed Two-Stage Agents Collaboration
-
To address the persistent issues of "modality dominance" and "spurious modality coupling" in centralized multimodal fusion, GCL reframes multimodal learning as a protocolized collaboration among four agents in two stages: In the first stage, Routing/Auditing agents determine, on a per-sample basis, which cross-modal communications are permitted based on marginal predictive gain; in the second stage, Public-Factor/Aggregation agents decouple shared semantics from private specialization before aggregation. This approach achieves SOTA on MOSI/MOSEI/MIntRec.
- Internalizing Agency from Reflective Experience
-
This paper proposes the LEAFE framework, enabling LLM agents to generate "failure→rollback→correction→success" experience data by reflecting on failed trajectories, and then distilling feedback-grounded recovery ability via SFT. On long-horizon tasks such as CodeContests, WebShop, and ALFWorld, Pass@128 is improved by up to 14%, significantly outperforming outcome-driven RL methods like GRPO.
- Position: Agentic AI Orchestration Should Be Bayes-Consistent
-
This position paper advocates: stop trying to make LLMs themselves "Bayesian" (that path is both theoretically and practically infeasible), and instead move Bayesian structure to the orchestration control layer of agentic AI—let the controller maintain a belief over low-dimensional, task-level latent variables, update it via Bayes’ rule on "message observations" returned by agents/tools, and use expected utility or value-of-information for routing, stopping, escalation, and budget allocation.
- Position: Assistive Agents Need Accessibility Alignment
-
This is a position paper. Through a systematic review of 778 blind assistance task instances from 417 papers, the authors argue that "accessibility alignment" should be considered a primary alignment objective for agents, on par with helpful/harmless/honest, and propose a design pipeline covering four dimensions: goal, interaction, risk, and lifecycle.
- PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts
-
PragLocker employs a two-stage strategy of "code-symbol initialization + noise injection under black-box target model feedback" to encode the agent system prompt into an obfuscated text that only works on the target LLM and fails on any other LLM. Thus, even if the prompt is stolen from the deployment side, attackers cannot reuse it on their own LLMs.
- ReSeek: A Self-Correcting Framework for Search Agents with Instructive Rewards
-
ReSeek augments RL-trained search agents with a JUDGE action and uses BGE-reranker to compute an "ideal judgment" as a process reward, enabling the agent to softly "mask" irrelevant information and re-query after each retrieval. It also introduces FictionalHot, a contamination-resistant benchmark based on fictional entities. On Qwen2.5-7B, the average EM reaches 0.377, +3.1 higher than ZeroSearch.
- Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining
-
Video2GUI employs a four-stage pipeline—coarse metadata filtering → fine-grained video quality filtering → Gemini-3-Pro for task/action extraction → high-resolution three-frame precise spatial grounding—to distill 500 million YouTube video metadata entries into WildGUI (12.7M trajectories, 124.5M screenshots, 1500+ applications), boosting Qwen2.5-VL/Mimo-VL by 5–20% on multiple GUI grounding and agent benchmarks.
- When Hallucination Costs Millions: Benchmarking AI Agents in High-Stakes Adversarial Financial Markets (CAIA)
-
CAIA establishes the first "adversarial high-stakes" agent benchmark using 17 cutting-edge large models on 178 time-anchored real-world cryptocurrency tasks. Key findings: without tools, all models achieve only 12–28% accuracy (near random guessing); with tools, even the strongest GPT-5 reaches only 67.4% vs. human junior analysts at 80%. More critically, 55.5% of model tool calls prefer "unreliable web search" over authoritative on-chain data, causing Pass@k metrics to systematically mask the dangerous "trial-and-error luck" behavior.
📐 Optimization & Theory¶
- Adaptive Estimation and Inference in Semi-parametric Heterogeneous Clustered Multitask Learning via Neyman Orthogonality
-
This work bridges double machine learning and clustered multitask learning, proposing an adaptive framework that combines Neyman orthogonality with data-driven pairwise fusion penalties. In a semiparametric setting with heterogeneous (possibly infinite-dimensional) noise, it accurately recovers latent task clusters, achieves oracle rates at the aggregation level, and establishes asymptotic normality for valid statistical inference.
- Budget-Feasible Mechanisms for Submodular Welfare Maximization in Procurement Auctions
-
This work is the first to provide a truthful mechanism with approximation guarantees for submodular social welfare maximization in procurement auctions with "budget constraint + private costs": BFM-SWM—a descending clock auction with geometrically increasing thresholds, single-point protection, and a price/payment rate parameter \(\beta\)—achieves non-negative surplus and budget feasibility, with a 0.0328-approximation for general submodular functions and 0.0877-approximation for monotone submodular functions. As a byproduct, BFM-VM improves the deterministic best approximation for value maximization from 1/64 to \(1/(12+4\sqrt{3})\approx 0.0528\), and reduces runtime from \(\mathcal{O}(n^2\log n)\) to \(\mathcal{O}(n\log n)\).
- FAB: A First-Order AB-based Gradient Algorithm for Distributed Bilevel Optimization over Time-Varying Directed Graphs
-
This paper proposes FAB—the first purely first-order algorithm for distributed bilevel optimization over time-varying directed graphs—combining AB/Push-Pull communication with value function penalization, providing a non-asymptotic \(\mathcal{O}(K^{-2/3})\) convergence rate, and incidentally resolving the long-standing open problem of AB/Push-Pull convergence rates in nonconvex, time-varying directed graph settings.
- Learning-Augmented Scalable Linear Assignment Problem Optimization via Neural Dual Warm-Starts
-
A lightweight network is trained to predict dual variables \(\hat{u}\) for the linear assignment problem (LAP). Using the Min-Trick, feasible duals \(\hat{v}\) are constructed and used as a warm start for the LAPJV exact solver, achieving over \(2\times\) end-to-end speedup on \(N=16{,}384\) instances while maintaining optimality.
- Learning Dynamics of Zeroth-Order Optimization: A Kernel Perspective
-
This paper uses empirical NTK as a unified perspective to prove that the eNTK induced by zeroth-order SGD is equivalent to projecting the first-order eNTK onto a random subspace spanned by perturbations. This, via the Johnson-Lindenstrauss lemma, explains why ZO methods still work on billion-parameter LLMs: the error depends only on output dimension \(V\) and perturbation count \(P\), and is independent of model dimension \(d\).
- Learning to Approximate Uniform Facility Location via Graph Neural Networks
-
This work designs an MPNN that "neuralizes" the classic approximation algorithm SimpleUniformFL for Uniform Facility Location, enabling end-to-end training with unsupervised expected cost loss, while also providing a provable \(\mathcal{O}(\log n)\) approximation guarantee (and \(\mathcal{O}(1)\) for the recursive version). Experiments show it outperforms the classic SimpleUniformFL algorithm and approaches ILP optimality.
- On the Convergence Rate of LoRA Gradient Descent
-
This work is the first to prove that original LoRA gradient descent achieves a minimum gradient norm convergence rate of \(O(1/\log T)\) without assuming bounded adapter matrices or requiring the reparameterized loss to be Lipschitz smooth (recovers the classical \(O(1/T)\) rate if parameter norms are bounded). Based on this, strictly theoretically-motivated adaptive/normalized learning rates are designed and empirically validated for accelerated and stabilized training on logistic regression, ResNet-18, and TinyLlama.
- Probing Neural TSP Representations for Prescriptive Decision Support
-
The authors treat a trained TSP neural solver as a "transferable encoder," using frozen representations and lightweight probes to predict two types of expensive operations research sensitivity queries (node removal and edge forbidding). They systematically demonstrate that probe accuracy improves monotonically with solver quality and that ensembling with traditional heuristics achieves SOTA.
- RL-SPH: Learning to Achieve Feasible Solutions for Integer Linear Programs
-
This paper proposes RL-SPH—an end-to-end reinforcement learning heuristic that does not rely on external ILP solvers and can independently produce 100% feasible solutions. By leveraging "feasibility reward + two-phase policy + feasibility-aware neighborhood search," the Graph Transformer Agent reduces the average primal gap by 28.6 times on ILPs with non-binary integer variables.
- RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization
-
Based on the "row block diagonal dominance" structure of the Transformer layer Hessian, this work replaces the expensive Newton-Schulz orthogonalization in the Muon optimizer with a single row-wise \(\ell_2\) normalization, reducing per-step preconditioning complexity from \(\mathcal{O}(mn\min(m,n))\) to \(\mathcal{O}(mn)\). On GPT-2 / LLaMA pretraining, this yields a 13–44× wall-clock speedup, with perplexity not only maintained but slightly improved.
- Streaming Sliced Optimal Transport
-
Stream-SW is the first algorithm capable of estimating the sliced Wasserstein distance on "sample streams": for each 1D projection, it maintains an approximate quantile function using KLL/quantile sketch, turning the closed-form integral of 1D Wasserstein into a streamable estimator with logarithmic space complexity in the number of samples, thus enabling SOT in IoT/edge device scenarios where samples are "seen once and discarded".
- Support-Proximity Augmented Diffusion Estimation for Offline Black-Box Optimization
-
SPADE replaces the traditional regression surrogate with a conditional diffusion model to model \(p(y\mid\boldsymbol{x})\), and implicitly injects data priors into the surrogate via "mean/rank calibration" and "kNN support regularization (mean shrinkage + variance inflation)", enabling offline black-box optimization to stably achieve SOTA on Design-Bench and LLM data mixture tasks.
- Test time training enhances in-context learning of nonlinear functions
-
This work establishes the first rigorous generalization bound for the combination of single-layer softmax-attention transformer and LoRA test-time fine-tuning, proving that on single-index polynomial tasks, TTT reduces the sample complexity of ICL from \(r^{\Theta(\mathrm{ie}(\sigma_*))}\) to \(r^{\Theta(\mathrm{ge}(\sigma_*))}\), allows the link function to vary per task, and enables inference error to approach the noise level as context length \(\to\) increases.
- Transformed Latent Variable Multi-Output Gaussian Processes
-
This paper proposes T-LVMOGP: it transforms the core modeling challenge of multi-output GPs—constructing cross-output covariance \(k_{p,p'}(x, x')\)—into "computing inner products with a single scalar base kernel in a Lipschitz-regularized RCNN embedding space," and fully embeds this into the SVGP framework. For the first time, MOGP can scalably and expressively handle \(P > 10000\) outputs (including ZINB-likelihood spatial transcriptomics data), comprehensively outperforming baselines such as SV-LMC, OILMM, and GS-LVMOGP.
- Turning Stale Gradients into Stable Gradients: Coherent Coordinate Descent with Implicit Landscape Smoothing for Lightweight Zeroth-Order Optimization
-
This paper stores "stale" block cyclic coordinate descent gradient estimates in a FIFO buffer, reusing them with momentum decay, and proves this is equivalent to BCCD with warm-start. It also presents a counterintuitive result: a larger finite difference step size \(\epsilon\) implicitly smooths the loss landscape and reduces the effective Lipschitz constant, making stale gradients yield more stable descent.
🤖 Robotics & Embodied AI¶
- Decompose and Recompose: Reasoning New Skills from Existing Abilities for Cross-Task Robotic Manipulation
-
For zero-shot robotic manipulation from "training tasks to novel tasks," the authors decompose demonstrations into "atomic skill-action pairs" as an intermediate representation. They then use a dual-library approach (dynamic library retrieves by visual/planning similarity; static library uses IDF-weighted tokens to supplement missing skills) to provide the LLM with skill-comprehensive in-context demonstrations, thereby upgrading "trajectory imitation" to "compositional skill reasoning."
- Drift is a Sampling Error: SNR-Aware Power Distributions for Long-Horizon Robotic Planning
-
This paper proposes CAPS: reinterpreting "instruction drift" as a systematic sampling error, using SNR (\(=\log|\mathcal{A}|-\mathcal{H}\)) as a metacognitive switch. Only when entering high-entropy "Pivotal Windows" does it trigger Metropolis-Hastings iterative refinement based on power distributions \(\pi\propto p^\alpha\). On RoboTwin, Simpler-WindowX, and Libero-long, it surpasses OpenVLA and TACO in a training-free manner.
- Embodied Interpretability: Linking Causal Understanding to Generalization in Vision-Language-Action Models
-
This paper reformulates "vision-action attribution" as an intervention estimation problem, proposing two metrics: ISS (Interventional Saliency Score) and NMR (Nuisance Mass Ratio). By using Bernoulli masks + Gaussian blur perturbation + Action MSE as a proxy for KL divergence, it quantifies which visual regions VLA policies actually rely on. It is shown that NMR is strongly negatively correlated with OOD task success rate (\(r = -0.77\)), making it a cheap diagnostic tool for predicting VLA generalization.
- From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation
-
MoLA employs a set of "modal-aware inverse dynamics models (IDM)" pre-trained on large-scale robotics data to translate future frames predicted by a video generation model into three discrete latent actions—semantic, depth, and optical flow. The policy head then controls based on these action-centric representations, achieving robust and accurate "imagination-to-execution" interfaces on CALVIN, LIBERO, LIBERO-Plus, and real UR5e robots.
- HDFlow: Hierarchical Diffusion-Flow Planning for Long-horizon Tasks
-
HDFlow uses a diffusion model to generate sparse strategic subgoals and a rectified flow to generate dense trajectories, further incorporating energy guidance and manifold projection. This constructs a two-layer planner with a division of labor between slow and fast modules, boosting the success rate of long-horizon, sparse-reward tasks such as furniture assembly by 20–30 percentage points.
- Latent Reasoning VLA: Latent Thinking and Prediction for Vision-Language-Action Models
-
LaRA-VLA internalizes both textual and visual CoT in VLA models as continuous latents. Through a three-stage curriculum training (explicit CoT → latent replacement → action expert adaptation), reasoning is completed in the latent space. Compared to explicit CoT, inference latency is reduced by up to 90%, restoring control frequency to real-time levels.
- Mitigating Error Accumulation in Continuous Navigation via Memory-Augmented Kalman Filtering
-
Reformulates step-by-step prediction in continuous UAV VLN as a "recursive Bayesian estimation = GRU prior + memory bank likelihood + learnable Kalman gain" closed loop. On TravelUAV, fine-tuning with only 10% of the data boosts L1-Full SR from 17.6% to 25.9%, while position drift after 100 steps is flattened to 30–40 meters.
- Optimal and Scalable MAPF via Multi-Marginal Optimal Transport and Schrödinger Bridges
-
This paper proves that anonymous multi-robot path planning (MAPF) can be formulated as a Markovian Multi-Marginal Optimal Transport (MMOT) problem, compressing the original \(K^{T+1}\)-dimensional transport tensor into a polynomial-size LP (P1), with total unimodularity guaranteeing integer optimality. It then generalizes to the Schrödinger bridge, yielding a Sinkhorn-style entropic relaxation (P2) that produces a "shadow transport." Finally, pruning and solving an LP (P3) on the shadow recovers integer solutions, achieving 3.6×–7.1× speedup and <10% cost gap at \(K^{1.15}\) complexity.
- Plan in Sandbox, Navigate in Open Worlds: Learning Physics-Grounded Abstracted Experience for Embodied Navigation
-
This paper proposes SAGE: automatically synthesizing large-scale navigation tasks and IF-THEN experience rules in a physics-constrained semantic sandbox, then distilling these experiences into a VLM policy using hybrid prompt sampling and asymmetric adaptive clipping GRPO. This approach boosts LLM-Match success rate on A-EQA from 43.5% to 53.2% (2B) / 60.2% (4B), and enables transfer to real indoor robots.
- Plug-and-Play Label Map Diffusion for Universal Goal-Oriented Navigation
-
This paper proposes PLMD: merging BEV semantic and obstacle maps into a Label Map, using DDPM to complete unexplored regions’ semantic + obstacle labels under obstacle priors, serving as a plug-and-play module for any GON policy. It consistently sets new SOTA on ON / IIN / MRON tasks across HM3D/MP3D.
- Seeing Realism from Simulation: Efficient Video Transfer for Vision-Language-Action Data Augmentation
-
To address the issue of VLA (vision-language-action) models collapsing under minor perturbations, this work proposes a video transfer pipeline—"extract semantic/geometric conditions → rewrite caption → conditional video diffusion re-rendering"—to inject visual and environmental diversity into simulation data. Additionally, a three-stage velocity caching reduces generation time by 61%, and a difficulty + diversity-driven coreset sampling selects only 10% of key trajectories. Ultimately, on Robotwin 2.0, LIBERO-Plus, and real robots, RDT-1B / \(\pi_0\) achieve 5–15% improvement.
- STEP: Warm-Started Visuomotor Policies with Spatiotemporal Consistency Prediction
-
STEP attaches a lightweight "previous action history + current observation → next action" Transformer predictor to a diffusion policy, using its output as a denoising warm-start. This compresses 100 denoising steps to just 2, and introduces an execution deadlock defense: if the action change is too small, a bit of noise is injected. Across 9 simulation and 2 real-world tasks, STEP outperforms BRIDGER / DDIM by 21.6% / 27.5% average success rate.
⚡ LLM Efficiency¶
- A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints
-
This work establishes the first queueing model for LLM inference that explicitly incorporates the dynamic behavior of KV cache memory, deriving a closed-form stability condition \(\lambda < \mu(1-\delta)\), enabling operators to directly compute the required number of GPUs. Validation on single GPU, 8-GPU clusters, and LongBench real data shows prediction error within \(10\%\).
- Not All Prefills Are Equal: PPD Disaggregation for Multi-turn LLM Serving
-
This paper identifies that in multi-turn dialogue scenarios, the traditional Prefill-Decode (PD) disaggregation architecture is highly inefficient due to repeated P→D recomputation and KV transmission at every turn. It proposes the PPD (Prefill-capable Decode) dynamic routing system, allowing decode nodes to decide—based on SLO weights—whether to locally process Turn 2+ append-prefill. This reduces Turn 2+ TTFT by approximately 68%.
- OServe: Accelerating LLM Serving via Spatial-Temporal Workload Orchestration
-
OServe jointly models LLM serving’s “resource allocation + parallelism strategy + request routing” as a two-level max-flow problem on a flow network. It leverages LSTM-based workload prediction and ad hoc model switching via GPU interconnects to address real-world traffic heterogeneity in both spatial (different request types) and temporal (composition shifts over time) dimensions. Compared to vLLM, OServe achieves an average 1.5× and up to 2× improvement in end-to-end P99 latency and throughput.
- PipeSD: An Efficient Cloud-Edge Collaborative Pipeline Inference Framework with Speculative Decoding
-
This paper proposes PipeSD: transforming speculative decoding from sequential cloud-edge execution to a token-batch pipeline, replacing fixed draft length with dual-threshold NAV triggering and Bayesian autotuning. On a real 5G cloud-edge testbed, PipeSD achieves 1.16×–2.16× speedup and 14–25% reduction in cloud energy consumption.
- Plan for Speed: Dilated Scheduling for Masked Diffusion Language Models
-
This work proposes the Dilated Unmasking Scheduler (DUS): by using a "dilated, equidistant" predefined unmasking order that does not rely on model confidence, the number of denoiser calls per block of \(B\) tokens is reduced from \(\mathcal O(B)\) to \(\mathcal O(\log B)\). On LLaDA / Dream / DiffuCoder, this achieves a 5.8× wall-clock speedup with quality surpassing confidence-based parallel planners.
- Scout: Active Information Foraging for Long-Text Understanding with Decoupled Epistemic States
-
Scout reframes million-token long-text understanding as an "active information foraging" process, introducing a provenance-anchored, trajectory-decoupled epistemic state \(\mathcal{E}_t\) as the sole basis for reasoning. Through gap-diagnosed self-evaluation, it iteratively contracts to a query-sufficient subset. On LooGLE-v2 and \(\infty\)Bench, it matches or surpasses state-of-the-art models like Gemini-3-Pro, while reducing token cost to about \(1/8\).
- SLAY: Geometry-Aware Spherical Linearized Attention with Yat-Kernel
-
SLAY linearizes the Yat-kernel, inspired by the physical "inverse-square interaction," through four steps: (1) spherical normalization, (2) Laplace integral representation via Bernstein theorem, (3) Gauss-Laguerre quadrature, and (4) tensor product positive random features for polynomial+exponential kernels. This yields an \(O(L)\) attention mechanism nearly indistinguishable from softmax.
- Theoretically Optimal Attention/FFN Ratios in Disaggregated LLM Serving
-
This work provides the first theoretical framework for the emerging Attention-FFN Disaggregation (AFD) inference architecture. Based on a probabilistic workload model with "finite mean prefill length + decode length following a geometric distribution," it derives a closed-form solution for the optimal A/F ratio under the rA-1F topology: \(r^*=\max\{r_A, r_C, r_{\text{peak}}\}\). A trace-calibrated simulator verifies that the theoretical and empirical optima differ by less than 10%.
- Towards Resource-Efficient LLMs: End-to-End Energy Accounting of Distillation Pipelines
-
The authors built a staged GPU energy measurement framework based on NVML, decomposing the distillation pipeline into "teacher side + student side + evaluation" for segment-wise accounting. They found that one-off teacher logit caching/synthetic data generation dominates energy use, causing KD and synthetic SFT to consume about \(2.4\times\) more energy than direct SFT for 1B–13B OLMo-2 students. They provide a closed-form break-even formula, showing distillation is only truly "energy-saving" when teacher outputs are reused more than \(N^*\) times.
- Training-Inference Consistent Segmented Execution for Long-Context LLMs
-
This paper proposes a long-context LLM framework where training and inference share exactly the same segmented forward execution semantics: only a fixed-length differentiable KV tail is retained across segments, plus a forward-only retrieval bypass. On LLaMA2-7B 32K/80K, it achieves comparable or even better LongBench/RULER performance than full attention with about \(6\times\) lower prefill peak memory.
- Understand and Accelerate Memory Processing Pipeline for Large Language Model Inference
-
This paper unifies optimizations in modern LLM long-context inference—such as sparse attention, RAG, and compressed context memory—into a four-stage "Prepare Memory → Compute Relevancy → Retrieval → Apply to Inference" memory processing pipeline. It quantitatively demonstrates that this pipeline accounts for 22%-97% of total latency and that each stage exhibits highly heterogeneous computational characteristics. Based on this, a GPU-FPGA heterogeneous system is proposed: regular/compute-intensive operations remain on the GPU, while sparse/irregular/memory-intensive operations are offloaded to the FPGA. On MI210 + Alveo U55C, up to 2.2× end-to-end speedup and 4.7× energy reduction are achieved.
📈 Time Series¶
- CombinationTS: A Modular Framework for Understanding Time-Series Forecasting Models
-
CombinationTS decomposes time-series forecasting models into five orthogonal modules: Input Transformation / Embedding / Encoder / Decoder / Output Transformation. It performs paired Monte Carlo sampling over a shared "evaluation condition space," replacing fragile single-point MSE with marginal performance \(\mu\) and stability \(\sigma\). The main conclusion: with well-designed data views (Embedding), a parameter-free Identity Encoder can match or even outperform complex Transformers. Much of the "SOTA gain" in time-series forecasting stems from how data is viewed, not from modeling capacity.
- DAG: A Dual Correlation Network for Time Series Forecasting with Exogenous Variables
-
For time series forecasting with known future covariates (TSF-X), DAG designs a dual-pathway network: one pathway captures "historical exogenous → future exogenous" attention patterns along the temporal dimension and injects them into "historical endogenous → future endogenous" prediction; the other captures "historical exogenous → historical endogenous" patterns along the channel dimension and injects them into "future exogenous → future endogenous" prediction. On 12 public/new TSF-X datasets, DAG achieves the best MSE in 10/10 cases, significantly outperforming TimeXer, TFT, TiDE, CrossLinear, PatchTST, etc.
- Doubly Outlier-Robust Online Infinite Hidden Markov Model
-
This paper proposes BR-iHMM: combining "robust observation update (WoLF)" with "batched state inference (degenerate sticky HDP prior)" to provide bounded Posterior Influence Function (PIF) in both observation and state spaces for online infinite HMMs. On streaming data with outliers from financial order books, electricity load, and synthetic regression, one-step prediction RMSE is reduced by up to 67%.
- Ellipsoidal Time Series Forecasting
-
Fern reformulates long-term time series forecasting as "optimal transport from a fixed Gaussian source to a data-dependent ellipsoid," leveraging the Brenier theorem to restrict the search space to SPD (symmetric positive definite) class Jacobians. Using low-rank spectral decomposition via Householder reflections, the computational cost is reduced from \(O(n^3)\) to \(O(Rn)\). In non-stationary shock scenarios, Fern achieves up to 790× stability improvement over baselines like DLinear/Koopa.
- FRACTAL: State Space Model with Fractional Recurrent Architecture for Computational Temporal Analysis of Long Sequences
-
This work generalizes the probabilistic measure underlying the HiPPO framework to a fractional power-law measure with a tunable singularity index \(\alpha\), thereby, for the first time, achieving "full-history retention + recent sensitivity + scale invariance" simultaneously. This theory is instantiated as an LTI diagonalized SSM—FRACTAL matches S5 with an 87.11% average on Long Range Arena and achieves 61.85% on ListOps.
- From Observations to States: Latent Time Series Forecasting
-
The authors observe that even state-of-the-art TSF models with high prediction accuracy often exhibit "temporal disorder" (Latent Chaos) in their latent spaces. They propose LatentTSF: first, an AutoEncoder compresses observations into a high-dimensional latent state space; then, any mainstream backbone predicts future states in this space (using a Pred + Align dual loss); finally, the predictions are decoded back to the observation space. On six standard benchmarks, this approach consistently reduces MSE/MAE and restores temporal locality and spectral structure in the latent representations.
- HELIX: Hybrid Encoding with Learnable Identity and Cross-dimensional Synthesis for Time Series Imputation
-
A learnable "identity embedding" is assigned to each feature as a persistent semantic anchor, combined with a time-feature double helix attention mechanism. HELIX achieves first place across all 21 missing data scenarios on 5 public multivariate time series datasets, outperforming the next-best ImputeFormer by over 25% MAE reduction on datasets like ETT-h1.
- PATRA: Pattern-Aware Alignment and Balanced Reasoning for Time Series Question Answering
-
For time series question answering (TSQA), PATRA explicitly decomposes sequences into full/trend/season patterns at the representation level, and performs deep cross-alignment with text via three sets of learnable alignment tokens. In training, a two-stage SFT + GRPO reinforcement learning approach is used, mapping both discriminative and generative task rewards to \([0,2]\) to address difficulty imbalance, thereby comprehensively surpassing text LLMs, ChatTS, and other multimodal temporal LLMs across four TSQA tasks.
- Time-series Forecasting Through the Lens of Dynamics
-
The authors propose the PRO-DYN nomenclature using Allen's interval algebra, decomposing any time-series forecasting (TSF) model into "Pre-processing PRO → Dynamics DYN → Post-processing PRO" three stages. They discover two empirical rules: (i) DYN must be learnable and complete to outperform LTSF-Linear, (ii) DYN must be placed at the very end of the pipeline (PRE-DYN configuration) to fully leverage long lookback benefits. By adding a linear DYN layer to Informer/FEDformer/MICN/FiLM, performance consistently improves; moving DYN to the front in iTransformer/PatchTST/Crossformer degrades performance, experimentally validating both rules.
- TSRBench: A Comprehensive Multi-task Multi-modal Time Series Reasoning Benchmark for Generalist Models
-
TSRBench constructs a time series reasoning benchmark covering 14 domains, 4 major dimensions (perception/reasoning/prediction/decision-making), 15 tasks, 4125 questions, and supports four input modalities: text, visualization, text+image, and embedding. It systematically evaluates 30+ mainstream LLMs, VLMs, and TSLLMs, revealing key findings such as "scaling holds for perception/reasoning but fails for prediction" and "text and visualization modalities are highly complementary, but current models can hardly fuse them."
🕸️ Graph Learning¶
- Anchor-guided Hypergraph Condensation with Dual-level Discrimination
-
AHGCDD rewrites hypergraph condensation (HGC) from the decoupled paradigm of "train structure generator first, then match training trajectories" into an end-to-end framework: Heat-Kernel-PageRank injects structural information into initial features, an anchor-guided approach synthesizes sparse, learnable hyperedges based on feature distances, and a dual-level discrimination loss (class prototype MMD + instance-level contrastive) replaces expensive HNN retraining. On five hypergraph benchmarks, it achieves ≥SOTA accuracy with up to 144× speedup.
- Full-Spectrum Graph Neural Network: Expressive and Scalable
-
This work generalizes the classical spectral GNN’s univariate eigenvalue filter \(g(\lambda_i)\) to a bivariate filter \(g(\lambda_i,\lambda_j)\), lifting the signal from the node domain to the node-pair domain. Theoretically, it can approximate Local 2-GNN (surpassing 1-WL), and avoids explicit \(n^2\times n^2\) computation via low-rank tensor decomposition, achieving strong results on node classification for heterophilic graphs and substructure counting.
- Information-Geometric Adaptive Sampling for Graph Diffusion
-
This work treats the sampling trajectory of the reverse SDE in graph diffusion as a parameterized curve on a Riemannian statistical manifold, deriving a training-free Drift Variation Score (DVS) from the Fisher-Rao metric to measure the local "information curvature" of the trajectory. Step sizes are adaptively scaled so that each step advances an equal length on the information manifold, achieving higher FCD / MMD fidelity with fewer steps in molecular (QM9/ZINC250k) and graph (Planar/SBM/Ego) generation.
- Learning Graph Foundation Models on Riemannian Graph-of-Graphs
-
R-GFM treats "subgraphs with different hop counts" as nodes in a higher-level Graph-of-Graphs, and uses a dynamic MoE routing mechanism to assign each GoG to the Riemannian manifold (hyperbolic / Euclidean / spherical) with the best-matched curvature. This simultaneously addresses two inherent limitations of existing graph foundation models: fixed receptive fields and single Euclidean embedding. It achieves up to 49% relative improvement on downstream tasks.
- On the Expressive Power of GNNs to Solve Linear SDPs
-
This work, from the perspective of the Weisfeiler–Leman hierarchy, for the first time characterizes the minimal GNN expressiveness required to learn solutions to linear SDPs. It proves that standard variable-constraint bipartite message passing (VC-WL) and higher-order VC-2-WL are insufficient, while the VC-2-FWL architecture, equivalent to 2-FWL, is sufficient to simulate the update steps of the PDHG solver. On synthetic and SDPLIB datasets, using high-quality predictions as warm-starts yields up to 80% acceleration.
- Polynomial Neural Sheaf Diffusion: A Spectral Filtering Approach on Cellular Sheaves
-
PolyNSD replaces the "one-step spatial diffusion" in Sheaf Neural Networks with a learnable \(K\)-order polynomial spectral filter on the normalized sheaf Laplacian, computed stably via Chebyshev three-term recurrence. A single layer thus achieves \(K\)-hop receptive field and controllable low/band/high-pass response. An unexpected finding is that using only diagonal restriction maps outperforms all existing NSDs requiring dense high-dimensional stalks, with significant reductions in parameters, memory, and runtime.
- Quantile-Free Uncertainty Quantification in Graph Neural Networks
-
QpiGNN proposes a "quantile-free, post-hoc-free" GNN node-level prediction interval framework, using a dual-head GNN (one head predicts the mean, the other predicts the half-width) combined with a label-level joint loss that directly optimizes "coverage + interval width." Across 19 synthetic/real datasets, it achieves an average 22% improvement in coverage and a 50% reduction in interval width.
- Structure-Centric Graph Foundation Model via Geometric Bases
-
SCGFM reframes cross-domain graph foundation modeling as a "triangulation" problem in metric measure spaces: it learns a set of \(K\) trainable geometric bases \(\{B_k\}\), represents each graph by the softmax of its Gromov–Wasserstein distances \(\delta_k\) to these bases to obtain a set of structural coordinates \(\mathbf{w}\), and uses the OT plan on the bases to aggregate node features into a unified dimension. This approach eliminates the traditional GFM constraint of "must align node feature spaces," and outperforms baselines in both in-domain and OOD few-shot graph/node classification.
- Unsat Core Prediction through Polarity-Aware Representation Learning over Clause-Literal Hypergraphs
-
This work models CNF formulas as a "clause–literal hypergraph + clause association graph," decomposes variable representations into polarity-invariant and polarity-equivariant components at the variable level, and trains with polarity-flip consistency regularization, significantly boosting unsat-core variable prediction accuracy.
🔄 Self-Supervised Learning¶
- A Refined Generalization Analysis for Extreme Multi-class Supervised Contrastive Representation Learning
-
This paper improves the upper bound of sample complexity for supervised contrastive learning (where tuples are constructed from a finite labeled data pool). By introducing two different U-statistics estimators, it achieves a breakthrough in the extreme multi-class setting: moving from bounds dependent on the minimum class probability to those dependent only on the number of classes or sample size.
- Beyond Distribution Estimation: Simplex Anchored Structural Inference Towards Universal Semi-Supervised Learning
-
This paper proposes SAGE, which replaces "estimating the distribution of unlabeled data" with "structural inference in representation space." By combining simplex ETF geometric anchors, high-order graph propagation, and distribution-agnostic reliability weighting, it achieves an average 8.52% accuracy improvement under the UniSSL setting with extremely scarce labels and arbitrary unlabeled distributions.
- Data Augmentation of Contrastive Learning is Estimating Positive-incentive Noise
-
The authors prove that "predefined data augmentations (rotation/cropping/flipping)" in contrastive learning are equivalent to point estimation of Positive-incentive Noise (π-noise). They then upgrade π-noise from "point estimation" to a learnable distribution by training a π-noise generator to add learnable noise to the original image as augmentation (PiNDA), leading to consistent improvements for SimCLR / BYOL / SimSiam / MoCo / DINO on vision tasks, and naturally adapting to non-vision data (HAR / Reuters / Epsilon) where manual augmentations are unavailable.
- How 'Neural' is a Neural Foundation Model?
-
The authors treat a "state-of-the-art foundation model (FNN) of mouse visual cortex" as a physiological experimental subject, analyzing its encoder, recurrent, and readout modules using the trio of decoding manifold, encoding manifold, and decoding trajectory. They find that FNN's fitting accuracy mainly relies on the readout's homogeneous feature maps, while only the recurrent module is truly "brain-like." Using a newly proposed tubularity metric, they quantitatively show that early encoding layers lack biological temporal structure, and provide clear recommendations for future neural foundation models: "add recurrence early, reduce feature dimensions in readout."
- Provable Accuracy Collapse in Embedding-Based Representations under Dimensionality Mismatch
-
The authors prove: For typical triplet tasks in contrastive learning, as long as the embedding dimension \(d\) is less than a constant multiple of the true dimension \(D\), the accuracy will "collapse" to the 50% baseline of a 1D random embedding, regardless of the optimizer used. Moreover, algorithmically, under the Unique Games Conjecture, this cannot be approximated in polynomial time.
- Statistical Consistency and Generalization of Contrastive Representation Learning
-
This work is the first to establish the Fisher/statistical consistency of contrastive representation learning (CRL), showing that "minimizing upstream contrastive loss is equivalent to optimal downstream AUC-type retrieval performance." It further provides sharp generalization bounds dependent on the number of positive samples \(n\) and negative samples \(m\): \(O(1/m+1/\sqrt n)\) (supervised) and \(O(1/\sqrt m+1/\sqrt n)\) (self-supervised). This theoretically explains, for the first time, why using tens of thousands of negatives in CLIP/SimCLR continues to improve performance.
- Text-Conditional JEPA for Learning Semantically Rich Visual Representations
-
This paper proposes TC-JEPA, which conditions the I-JEPA masked feature predictor additionally on image captions. By applying multi-layer sparse cross-attention, patch representations become predictable under textual "prompts," enabling the learning of semantically richer and dense prediction-friendly visual representations without contrastive loss.
- The Geometric Mechanics of Contrastive Representation Learning: Alignment Potentials, Entropic Dispersion, and Cross-modal Divergence
-
This paper elevates the InfoNCE loss to a deterministic "population energy" over representation distributions using a measure-theoretic framework, proving that the unimodal case is convex and converges to a unique Gibbs equilibrium, while the symmetric multimodal case exhibits persistent negative symmetric KL coupling, which geometrically and inevitably induces a modality gap.
- Understanding Self-Supervised Learning via Latent Distribution Matching
-
The authors unify contrastive, non-contrastive, and predictive SSL as "Latent Distribution Matching (LDM)": maximizing the log-likelihood of samples under an assumed latent model (alignment) + maximizing latent entropy (uniformity), and based on this, derive a nonlinear identifiable predictive SSL with a Kalman predictor.
🧊 3D Vision¶
- FSI2P: A Hierarchical Focus–Sweep Registration Network with Dynamically Allocated Depth
-
This paper abstracts the human observation process of "first scanning broadly, then examining in detail" into a two-stage Focus-Sweep paradigm. It replaces Transformer with Mamba for image-point cloud interaction and uses reinforcement learning to dynamically determine the number of interaction layers at each scale, achieving SOTA in I2P registration on RGB-D Scenes V2 and 7-Scenes.
- LabBuilder: Protocol-Grounded 3D Layout Generation for Interactable and Safe Laboratory
-
LabBuilder compiles free-text experimental descriptions into "asset-chemical protocol" pairs, then employs hierarchical generation, geometric/chemical multi-objective optimization, and navigation repair to produce 3D chemical laboratory layouts that are both visually plausible and executable by robots.
- Pair2Scene: Learning Local Object Relations for Procedural Scene Generation
-
Pair2Scene shifts 3D indoor scene generation from "directly fitting the global joint distribution" to "learning pairwise local object relations (support + function) and recursively assembling them via a scene hierarchy tree." With point cloud geometric encoding, Mixture-of-Logistics probabilistic heads, and collision-aware rejection sampling, it can generate complex scenes with object counts rising from about 4 to about 14 using only 3D-Front training data. Both FID and user studies outperform baselines such as ATISS, DiffuScene, and LayoutVLM.
- PhysForge: Generating Physics-Grounded 3D Assets for Interactive Virtual World
-
Reframes "creating interactive 3D objects" as a two-stage problem: "physical planning first, physical generation second." The VLM acts as a physical architect, generating a "Hierarchical Physical Blueprint" with hierarchical relationships, materials, and kinematic constraints. A diffusion model then uses KineVoxel Injection to co-denoise articulation parameters and geometric voxels. Combined with the PhysDB dataset (150k assets with four-layer annotations), this approach achieves the first single-view-to-"simulation-ready" 3D asset generation capable of grasping, pushing, and articulating in physics engines.
- PhysHanDI: Physics-Based Reconstruction of Hand-Deformable Object Interactions
-
This paper introduces PhysHanDI, which couples the MANO hand model with a Spring-Mass soft object model. Dense hand meshes drive the physical simulation of deformable objects, while object simulation refines hand reconstruction. The method achieves SOTA dense 3D reconstruction of both hands and soft objects from sparse-view RGB-D videos.
- R\(^3\)L: Reasoning 3D Layouts from Relative Spatial Relations
-
R³L attributes the two systematic errors in MLLM multi-hop "relative spatial relation" reasoning (semantic drift and metric drift) to "repeated reference frame transformations," and introduces three modules—Invariant Spatial Decomposition (shortening relation chains), Consistent Spatial Imagination (imagine-and-revise loop to eliminate conflicts), and Supportive Spatial Optimization (global-local pose reparameterization)—to enable GPT-5-generated open-vocabulary 3D scenes to achieve near-zero collision and out-of-bounds rates across 9 scene types, with semantic metrics significantly surpassing LayoutVLM/Holodeck/LayoutGPT.
- Revisiting Photometric Ambiguity for Accurate Gaussian-Splatting Surface Reconstruction
-
AmbiSuR explicitly models two types of intrinsic photometric ambiguities in Gaussian Splatting (primitive boundary spillover and under-constrained pixel blending) and resolves them using truncation and ray-color consistency. It further employs higher-order spherical harmonic coefficients as "self-indicators" to identify high-risk ambiguous primitives and applies amorphous local prior regularization. This reduces the average Chamfer distance on DTU to 0.46, surpassing the previous best GeoSVR (0.47).
- SplAttN: Bridging 2D and 3D with Gaussian Soft Splatting and Attention for Point Cloud Completion
-
This work identifies that in multimodal point cloud completion, "hard projection of 3D points directly onto 2D grids" leads to a support set with Lebesgue measure zero and gradients truncated by Dirac delta (termed Cross-Modal Entropy Collapse). The authors replace hard projection with differentiable Gaussian Soft Splatting for continuous density estimation, and employ a hybrid encoder combining local EdgeConv and global Transformer, along with a global-local decoder. The method achieves SOTA on PCN/ShapeNet-55/34, and counter-factual evaluation on KITTI demonstrates that baselines actually degenerate into "unimodal template retrievers."
📊 LLM Evaluation¶
- CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers
-
This paper introduces CoCoReviewBench, which transforms human reviews of 3,900 ICLR/NeurIPS papers into a more reliable AI review evaluation reference through a two-step process: (1) constructing sub-benchmarks by category, and (2) filtering erroneous opinions by arbitrating reviewer/author conflicts using meta-reviews. The study finds that current AI reviewers still lag behind humans in correctness and thoroughness, while reasoning models show greater potential.
- Hallucinations Undermine Trust; Metacognition is a Way Forward
-
This position paper argues that "completely eliminating LLM hallucinations" is fundamentally subject to a "discrimination gap" (discrimination gap → utility tax); the authors advocate shifting the goal from "eliminating hallucinations" to faithful uncertainty, and view such metacognition as an indispensable control layer when agentic LLMs invoke tools.
- Investigating Advanced Reasoning of Large Language Models via Black-Box Environment Interaction
-
This work proposes "black-box environment interaction" as a new paradigm for evaluating integrated reasoning (abduction + deduction + induction) in LLMs, constructing the ORACLE benchmark with 96 environments across 6 task types. Benchmarking 19 LLMs reveals that even the strongest model, o3, achieves only 70% accuracy in simple environments and drops to 40% in difficult ones. All LLMs lack high-level planning abilities for "adaptive optimization of exploration strategies based on feedback."
- iWorld-Bench: A Benchmark for Interactive World Models with a Unified Action Generation Framework
-
iWorld-Bench is the first unified evaluation benchmark specifically designed for "interactive world models." It proposes an Action Generation Framework that maps three types of action inputs—text, one-hot, and camera intrinsics/extrinsics—into a unified command space. Based on 330K videos, it carefully selects 4.9K tasks and 9 metrics to comprehensively compare 14 mainstream models.
- Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge
-
RACER formulates the problem of "deciding whether to invoke reasoning mode for each query in LLM-as-a-Judge" as a distributionally robust constrained optimization with a KL uncertainty set. It solves for the optimal routing policy under OOD conditions that still satisfies the cost budget using a primal-dual algorithm, and for the first time provides a linear convergence guarantee for LLM router policies.
- Reward Hacking Benchmark: Measuring Exploits in LLM Agents with Tool Use
-
RHB constructs a suite of realistic multi-step tool-use tasks (both independent and chained modes, covering data pipeline, log forensics, performance optimization, and multi-file reconstruction) to quantify reward hacking behaviors in LLM agents. Across 13 frontier models, it is found that RL post-training significantly increases exploit rates (DeepSeek-V3 0.6% vs R1-Zero 13.9%). Exploit rates rise with chain length, and even models with near-zero rates "relapse" on harder variants. Lightweight environment hardening can reduce exploit rates by 87.7% without harming task success.
- Stop Automating Peer Review Without Rigorous Evaluation
-
This is a position paper: Through empirical measurement of real ICLR 2026 reviews and 60 simulated reviews, the authors identify two major failures in current LLM reviewing—hivemind (high convergence) and paper laundering (zero-shot rewriting can increase scores by 0.45). They argue that "LLMs should not directly generate review reports without rigorous evaluation" and call for the establishment of a "science of review automation."
- Token-Efficient Change Detection in LLM APIs
-
The authors prove that under low-temperature sampling, special inputs where "two token logits are nearly tied" (Border Inputs) are extremely sensitive to parameter perturbations—theoretically, SNR diverges as \(T\to 0\). Thus, by observing only output tokens (strict black-box), LLM API change detection can be performed with very few queries. The proposed B3IT matches gray-box logprob methods on the TinyChange benchmark at 1/30 the cost, and in 23 days of continuous monitoring across 93 commercial endpoints, it detected 8 real model replacements.
📹 Video Understanding¶
- CLEAR: Context-Aware Learning with End-to-End Mask-Free Inference for Adaptive Video Subtitle Removal
-
This paper proposes CLEAR for video subtitle removal: a two-stage training pipeline (Stage I uses a dual encoder + orthogonal decoupling to self-supervise a subtitle prior mask; Stage II adds LoRA + an occlusion head to the Wan2.1 video diffusion model for adaptive weighting). Inference requires no mask or text detector at all; with only 0.77% trainable parameters, PSNR reaches 26.80 dB on a Chinese test set (+6.77 dB over the strongest baseline), and zero-shot generalizes to six languages.
- Find, Fix, Reason: Context Repair for Video Reasoning
-
This work addresses the dilemma in video reasoning where "on-policy RL stagnates at a capability ceiling, while off-policy distillation leads to entropy collapse." It introduces a frozen, tool-integrated large teacher model that inserts minimal "evidence patches" (key-frame intervals, error types) when the student fails during rollout. The student then re-answers the same question, and the repaired trajectory is incorporated into GRPO optimization via a chosen-rollout mechanism.
- Privacy-Aware Video Anomaly Detection through Orthogonal Subspace Projection
-
The authors propose OPL (Orthogonal Projection Layer) and its enhanced version G-OPL, which use a learnable orthogonal subspace derived from QR decomposition to explicitly project out "task-irrelevant variables" and "facial privacy components" from the feature space of video anomaly detection. Four privacy-aware metrics (SSC/ARD/PD/FPD) are introduced. While maintaining or improving VAD AUC, the accuracy of linear SVM probes for facial prediction drops significantly.
- RELO: Reinforcement Learning to Localize for Visual Object Tracking
-
RELO reframes the "where is the target" problem in visual single-object tracking as an MDP over a spatial feature map, treating each spatial location as an action. It replaces traditional handcrafted center heatmap supervision with actor-critic and direct IoU/AUC rewards, and introduces two stabilization designs—"warmup regression" and "layer-aligned temporal token propagation." On LaSOText, it achieves SOTA with 57.5% AUC.
- Revisiting Uncertainty: On Evidential Learning for Partially Relevant Video Retrieval
-
This paper addresses the query ambiguity and temporally sparse supervision in Partially Relevant Video Retrieval (PRVR) caused by "short queries vs. long videos." It proposes Holmes, a hierarchical evidential learning framework based on the Dirichlet distribution. Holmes distinguishes precise, polysemous, and under-determined queries across videos using a triple-principle and adaptively calibrates labels. Within videos, it achieves dense alignment via flexible optimal transport with a dustbin. The method achieves SOTA on ActivityNet, Charades, and TVR datasets.
- STORM: Segment, Track, and Object Re-Localization from a Single Image
-
STORM proposes a "single reference image" 6D pose tracking framework: hierarchical spatial fusion attention (HSFA) aligns reference-query features (producing segmentation masks + SAM3D mesh), then a BCE-trained Tracking Verifier outputs a logit whose negative is used as an energy score \(E=-g_\theta\). If the score exceeds a threshold for \(L=3\) consecutive frames, automatic re-localization is triggered. This pushes annotation-free 6D tracking accuracy on LM-O / YCB-V close to the ground-truth mask upper bound.
- Unified Multimodal Visual Tracking with Dual Mixture-of-Experts
-
OneTrackerV2 unifies five tracking tasks—RGB, RGB+D, RGB+T, RGB+E, RGB+N—into a single network trained end-to-end. It uses a Meta Merger for modality fusion, and Dual MoE to explicitly decouple "spatiotemporal matching" and "modality fusion" into T-MoE and M-MoE, respectively. Dissimilarity loss and router clustering ensure these do not collapse into the same subspace.
- VideoSEAL: Mitigating Evidence Misalignment in Agentic Long Video Understanding by Decoupling Answer Authority
-
VideoSEAL identifies a prevalent "correct answer without seeing evidence" misalignment in existing agentic long video QA systems, attributing the root cause to "coupled agents conflating planning and answering authority." It proposes a planner-inspector decoupling framework: the planner is responsible for long-horizon evidence search, while the inspector holds exclusive answering authority and only permits answers when pixel-level evidence is sufficient. On LVBench, accuracy improves from 48.2% to 55.1% (↑20.5%), and on LongVideoBench from 52.2% to 62.0%.
🎵 Audio & Speech¶
- Alethia: A Foundational Encoder for Voice Deepfakes
-
Alethia introduces a dual-branch pretraining paradigm of "bottleneck-style masked embedding prediction + Flow-Matching spectrogram generation," training the first foundational encoder for voice deepfake detection, localization, and attribution. It significantly outperforms general-purpose SFMs like Wav2vec2, HuBERT, and WavLM across 5 tasks and 56 datasets, demonstrating strong zero-shot robustness to unseen singing deepfakes and real-world perturbations.
- MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks
-
MECAT constructs 20k multi-perspective fine-grained audio captions and 100k open-ended QA using a "multi-expert model + CoT large model reasoning" pipeline, and proposes the DATE metric (harmonic mean of semantic similarity × cross-sample discriminability), enabling, for the first time, stable distinction between generic and detail-accurate audio model outputs.
- MedMosaic: A Challenging Large Scale Benchmark of Diverse Medical Audio
-
MedMosaic constructs a medical audio QA benchmark (46,701 QA pairs, 10 question types) covering physiological sounds and real/synthetic clinical dialogues via a synthetic pipeline. It systematically evaluates 13 audio/multimodal models, revealing that even Gemini-2.5-Pro achieves only about 68.1% weighted accuracy, exposing fundamental limitations of contemporary LALMs in medical audio reasoning.
- MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models
-
MoshiRAG introduces a special ⟨ret⟩ trigger token into Moshi, a full-duplex speech model, enabling the model to asynchronously call an LLM/search backend for reference documents while speaking. By leveraging the natural "keyword delay" (the interval from speaking onset to keyword appearance), retrieval latency under 2 seconds is completely masked. This elevates the factuality of the speech model to the level of GPT-4o Audio on LlamaQ/WebQ/TriviaQA/HaluEval, while preserving full-duplex real-time interaction.
- NAACA: Training-Free NeuroAuditory Attentive Cognitive Architecture with Oscillatory Working Memory for Salience-Driven Attention Gating
-
A real-time salience detector inspired by cortical oscillations is implemented as a 2D oscillatory wave field (OWM), serving as a "training-free attention gate" for Audio Language Models (ALMs) on long audio. Only truly salient windows are fed into the ALM, boosting AP on XD-Violence from 53.5% to 70.6% while reducing ALM invocations by about 40%.
- Polyphonia: Zero-Shot Timbre Transfer in Polyphonic Music with Acoustic-Informed Attention Calibration
-
Polyphonia extends zero-shot timbre transfer from single-track to dense multi-track mixtures: using the Ideal Ratio Mask (IRM) from blind source separation as an external acoustic prior, it performs "source interpolation + acoustic modulation" in the pre-softmax attention logits, enabling the target stem's (e.g., vocals) spectrum to be replaced by a new timbre (e.g., violin) while strictly preserving the background accompaniment. Compared to SOTA, it improves target alignment by 15.5%.
- Probing Cross-modal Information Hubs in Audio-Visual LLMs
-
By combining causal tracing and a unimodal-dominant framework, the authors reveal the existence of hidden "cross-modal sink tokens" in audio-visual LLMs, where the vast majority of cross-modal information is concentrated. Based on this, they propose a training-free attention amplification strategy that significantly alleviates object hallucinations.
⚖️ Alignment & RLHF¶
- BLOCK-EM: Preventing Emergent Misalignment via Latent Blocking
-
BLOCK-EM uses SAE to identify a small set of internal latents that causally control emergent misalignment, then adds a one-sided regularizer during narrow-domain SFT to prevent the model from amplifying these latents in the misalignment direction—reducing emergent misalignment by an average of 93% across 6 fine-tuning domains, with almost no loss in in-domain task performance.
- \(f\)-Divergence Regularized RLHF: Two Tales of Sampling and Unified Analyses
-
This paper establishes, for the first time, \(O(\log T)\) regret and \(O(1/T)\) suboptimality gap upper bounds for online RLHF under general \(f\)-divergence regularization. Two sampling strategies are proposed: (1) an optimism-in-face-of-uncertainty approach with a bonus term; (2) a novel "derivative-as-uncertainty" perspective—treating \(f'\) as an uncertainty signal, enabling derivative-based sampling without explicit confidence bound estimation each round.
- Pareto-Guided Optimal Transport for Multi-Reward Alignment
-
PG-OT shifts "multi-reward text-to-image alignment" from "weighted global summation" to "constructing a Pareto frontier for each prompt and using Sinkhorn optimal transport to move dominated samples to the frontier," introducing two new metrics, Joint Domination Rate / Joint Collapse Rate, to expose reward hacking masked by averaging. On Parti-Prompts, JDR₂ reaches 47.98%, an 11% improvement over strong baselines, with a human evaluation win rate close to 80%.
- Reward Modeling from Natural Language Human Feedback
-
This paper identifies a severe "outcome-process inconsistency" (20–30%, up to 44%) in generative reward models (GRM) trained on binary preference rewards, where the model guesses the correct preference but provides an incorrect critique. The authors propose RM-NLHF: using the similarity between model and human critiques on core arguments as an additional process reward, and employing MetaRM to automatically predict process rewards and update them online with policy changes. This approach consistently outperforms outcome-only GRPO-trained SOTA GRMs across multiple benchmarks.
- The Realignment Problem: When Right becomes Wrong in LLMs
-
This paper formalizes the "what if the policy changes after model deployment" scenario as the Realignment problem, and proposes the TRACE framework: using a stronger proxy model to triage existing preference pairs into three categories (Invert / Punish / Retain), then performing surgical realignment with a hybrid IPO+NPO+KL objective, enabling adaptation to policy drift without a new round of human annotation.
- Toward Stable Value Alignment: Introducing Independent Modules for Consistent Value Guidance
-
This paper proposes SVGT, which shifts value alignment from "embedding into backbone parameters/activations" to "attaching an independent value module." The module continuously assesses the safety direction of the current hidden state in an isolated value space, then uses a set of learnable Bridge Tokens as explicit attention anchors to guide generation. Across four backbones, harmfulness scores are reduced by over 70% with almost no loss in fluency.
- TUR-DPO: Topology- and Uncertainty-Aware Direct Preference Optimization
-
TUR-DPO augments DPO's preference logits with a "semantic + topological structure" shaping reward difference and an instance weight dynamically down-weighted by per-pair uncertainty. This allows the model to explicitly reward structural soundness of reasoning and suppress the impact of fragile preference pairs, while retaining the simplicity of RL-free training. As a result, TUR-DPO systematically outperforms DPO and IPO on reasoning tasks such as GSM8K / MATH / BBH / QA, and matches PPO on most tasks.
📄 multi_agent¶
- E-mem: Multi-Agent Based Episodic Context Reconstruction for LLM Agent Memory
-
E-mem shifts the traditional memory paradigm of "preprocessing and compressing into embeddings/graphs" to an episodic reconstruction paradigm of "retaining original context + in-situ reasoning by small model assistants": the master agent only performs global planning, while multiple SLM assistants each guard an uncompressed segment of the original text. Upon multi-pathway retrieval and activation, they conduct local reasoning and return evidence. On LoCoMo, E-mem surpasses SOTA by 7.75 F1 points while reducing token consumption by 70%.
- EngiAgent: Fully Connected Coordination of LLM Agents for Solving Open-ended Engineering Problems with Feasible Solutions
-
EngiAgent decomposes engineering problem solving into five expert agents: Analyzer, Modeler, Verifier, Solver, and Evaluator. A fully connected coordinator dynamically routes feedback (instead of following a fixed pipeline), boosting the feasible solution rate on engineering tasks with GPT-4o from 5.66% (zero-shot) / 7.55% (MM-Agent) to 64.15%—an average improvement of about 7x over previous SOTA.
- MASPO: Joint Prompt Optimization for LLM-based Multi-Agent Systems
-
MASPO achieves end-to-end joint optimization of role prompts for the entire multi-agent chain without relying on annotations, by combining multi-granularity joint evaluation (local validity + lookahead potential + global alignment) and misalignment-case-driven evolutionary beam search. This yields an average improvement of about 2.9 points across 6 tasks.
- OMAC: A Holistic Optimization Framework for LLM-Based Multi-Agent Collaboration
-
This paper formalizes the optimization space of multi-agent systems (MAS) into five dimensions (two functional + three structural), and applies a dual-actor algorithm—"Semantic Initializer generation + Contrastive Comparator improvement"—to perform supervised optimization on each dimension. It then iteratively and jointly optimizes multiple dimensions, consistently outperforming baselines such as DyLAN, ADAS, and AFlow on HumanEval, MMLU, and MATH.
- RADAR: Redundancy-Aware Diffusion for Multi-Agent Communication Structure Generation
-
RADAR models the communication topology design of multi-LLM-Agent systems as a "redundancy-aware" discrete graph diffusion process, using effective size as a guiding signal to incrementally generate query-adaptive collaboration graphs. It achieves higher accuracy, lower token consumption, and stronger robustness across six benchmarks.
- Systematic Failures in Collective Reasoning under Distributed Information in Multi-Agent LLMs
-
This paper adapts the social psychology Hidden Profile paradigm to multi-agent LLM evaluation, constructing a 65-task HiddenBench. Systematic evaluation on 15 cutting-edge LLMs reveals: for tasks where a single agent achieves 80.7% accuracy under Full Profile, multi-agent setups under distributed information achieve only 30.1%. The fundamental failure mode is the inability to proactively elicit information not disclosed by others. However, lightweight structured communication protocols can significantly mitigate this across model families.
🎬 Video Generation¶
- Attention Sparsity is Input-Stable: Training-Free Sparse Attention for Video Generation via Offline Sparsity Profiling and Online QK Co-Clustering
-
SVOO discovers that the attention sparsity of each layer in video DiT is an intrinsic property that is "input-invariant within layers and significantly heterogeneous across layers." Based on this, it first performs offline layer-wise sparsity calibration, then conducts online QK bidirectional co-clustering. Without retraining, it achieves up to 1.93× acceleration on 7 models including Wan/HunyuanVideo while maintaining PSNR at 29 dB.
- Exploring Data-Free LoRA Transferability for Video Diffusion Models
-
This work is the first to analyze the weight space of full fine-tuning (FFT) and LoRA for video diffusion models (VDM), finding that both "preserve the singular spectrum and only rotate the singular subspace," but their routing directions conflict on head clusters. Based on this, the authors propose CASA—a data-free "cluster-wise spectral arbitration" LoRA transfer method that directly migrates LoRA trained on the base Wan2.1 to distilled variants like FastWan, without any user data or retraining.
- Lightning Unified Video Editing via In-Context Sparse Attention
-
To address the secondary attention bottleneck in video editing under the In-Context Learning (ICL) paradigm, the authors design In-context Sparse Attention (ISA) based on two insights: "context tokens are significantly less salient than source tokens" and "Query sharpness is proportional to Taylor approximation error." They train LIVEditor, which both accelerates inference by ~60% and surpasses SOTA full-attention models on multiple benchmarks.
- MiVE: Multiscale Vision-language features for reference-guided video Editing
-
MiVE simultaneously extracts the first and last layer hidden states from Qwen3-VL as multiscale condition tokens, concatenates them with VAE visual latents into a long sequence, and performs reference-guided video editing using unified self-attention in DiT. On a 60-clip 720P benchmark, it achieves top human preference and six VLM-based automatic scores, outperforming open-source Wan-Animate and commercial Kling O1.
- Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization
-
QVG is a training- and finetuning-free KV-cache quantization framework for autoregressive video diffusion. It performs token smoothing via semantic-aware clustering and progressively compresses residuals in multiple stages. On LongCat-Video/HY-WorldPlay/Self-Forcing, it reduces KV memory to 1/7 of the original, with end-to-end latency overhead <4%. At 2 bits, it significantly outperforms LLM quantization baselines such as KIVI/QuaRot in quality.
- VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation
-
VAnim models open-domain text-to-SVG animation as "sparse state updates on a persistent DOM tree" + "Identification-First motion planning" + "GRPO rendering-aware reinforcement learning," achieving a \(9.86\times\) sequence length compression while preserving topology, and significantly surpassing GPT-5.2, Gemini 3 Pro, and LiveSketch.
🔎 AIGC Detection¶
- Black-Box Detection of LLM-Generated Text Using Generalized Jensen-Shannon Divergence
-
SurpMark reformulates "AI text detection" as likelihood-free hypothesis testing: it computes token surprisal using a proxy LM, discretizes it into \(k\) states via k-means, estimates a first-order Markov transition matrix, and compares it with pre-built "human-written/machine-written" reference matrices using Generalized Jensen-Shannon Divergence (GJS). This approach provides black-box detection scores without retraining or per-instance resampling, requiring only a single forward pass.
- DGS-Net: Distillation-Guided Gradient Surgery for CLIP Fine-Tuning in AI-Generated Image Detection
-
This paper addresses the issue of "catastrophic forgetting of transferable priors when fine-tuning CLIP for AI-generated image detection" by proposing DGS-Net: the gradient of the classification loss is decomposed by coordinate into harmful positive components \(g^+\) and beneficial negative components \(g^-\). The image gradient of the training network is first orthogonally projected onto the complement space of the harmful direction of the frozen CLIP text gradient (Orthogonal Suppression, removing task-irrelevant semantics), and then further aligned to the beneficial direction of the frozen CLIP image gradient (Prior Alignment, preserving pre-trained priors). As a result, the average detection accuracy across 50 generative models surpasses SOTA by 6.6%.
- Feature-Augmented Transformers for Robust AI-Text Detection Across Domains and Generators
-
This paper systematically exposes the vulnerability of AI text detectors under cross-dataset/cross-generator shifts within a "single threshold fixed protocol" and proposes integrating learnable attention-weighted handcrafted linguistic features with transformer [CLS] representations. Using a DeBERTa-v3 backbone, the method achieves 85.9% balanced accuracy on the M4 multi-domain multi-generator benchmark, outperforming strong zero-shot baselines (Fast-DetectGPT, RADAR, Log-Rank) by up to +7.22.
- PRPO: Paragraph-level Policy Optimization for Vision-Language Deepfake Detection
-
The authors introduce a 115k-sample DF-R5 dataset with reasoning annotations, replace CLIP ViT with ConvNeXT in the DX-LLaVA architecture, and propose PRPO—a paragraph-level GRPO variant. Each paragraph is rewarded based on CLIP text-image similarity (VCR) and majority-vote consistency between reasoning and conclusion (PCR). This approach boosts cross-domain deepfake detection F1 from SOTA 75.26% to 89.91%, and reasoning quality from 4.2/5 to 4.55/5.
💬 LLM / NLP¶
- A Geometric Relation of the Error Introduced by Sampling a Language Model's Output Distribution to its Internal State
-
This paper characterizes the information loss introduced by sampling from high-entropy distributions in GPT-style LLMs from a differential geometry perspective. It constructs \(\mathfrak{so}(n)\)-valued 1-forms and parallel transport operators, and demonstrates in chess probing experiments that such geometric rotations are highly aligned with the world vectors learned by the model.
- Escaping Mode Collapse in LLM Generation via Geometric Regulation
-
This work reinterprets "mode collapse" (repetition, looping, monotony) in LLM long-form generation from a dynamical systems perspective as "geometric collapse" of hidden state trajectories in representation space. It proposes RMR—a lightweight low-rank damping on the Transformer value cache to suppress the most persistent self-reinforcing directions, thereby maintaining stable, high-quality generation even in extremely low-entropy decoding regimes (0.8 nats/step).
- Top-W: Geometry-Aware Decoding with Wasserstein-Regularized Truncation and Mass Penalties for LLMs
-
Top-W formulates next-token truncation as a minimization problem with three terms—Wasserstein (geometry-aware), entropy, and mass—explicitly considering token embedding geometry. Theoretically, the optimal solution is either a singleton token or a prefix sorted by \(f(i)+\lambda\log p_i\). The engineering implementation is just an \(O(n\log n)\) scan. On GSM8K, GPQA, AlpacaEval, and MT-Bench, Top-W outperforms in the majority of 15 (T, model) combinations, and at high temperatures, improves GSM8K by up to 33.7% over Top-H.
- Rethinking LLM Ensembling from the Perspective of Mixture Models
-
This paper proves that when performing token-level ensembling over \(n\) LLMs, it is unnecessary to run all models at each step—randomly sample one model according to the weights to generate the next token, and the output distribution is strictly equivalent to "average then sample." This reduces the \(n\)-fold forward pass back to a single forward pass, and, combined with "lazy KV cache synchronization," achieves a practical speedup of 1.78×–2.68×.
✂️ Segmentation¶
- LightAVSeg: Lightweight Audio-Visual Segmentation
-
LightAVSeg decouples "semantic selection (what)" and "spatial localization (where)", replacing \(\mathcal{O}(N^2)\) cross-modal attention with global channel modulation. This enables the AVS model to achieve 50.4 mIoU (MS3) with only 20.5M parameters and 163.4 ms on Snapdragon 8 Elite, about \(8\times\) faster than AVSegFormer-R50.
- Segment Anything with Robust Uncertainty-Accuracy Correlation
-
To address the issue that the SAM series only outputs a single mask-level confidence and suffers from "Mask-level Confidence Confusion" under domain shift, this work equips SAM2 with a Weibull dual-granularity Bayesian mask decoder for pixel-level epistemic estimation. Inspired by human vision, a style + deformation collaborative adversarial perturbation and calibration loss are introduced, ensuring that uncertainty remains aligned with error across 23 zero-shot target domains. The average J&F reaches 79.87, and the uncertainty maps become significantly more reliable.
- SEMIR: Semantic Minor-Induced Representation Learning on Graphs for Visual Segmentation
-
SEMIR treats the voxel grid as a parent graph \(G\), compresses it into a "boundary-aligned" graph minor \(H\) (reducing node count from \(\sim10^7\) to \(\sim10^3\)) via parameterized edge contraction, node deletion, and edge deletion. Using only 5–20 few-shot samples, it black-box optimizes \(\Theta\) to maximize boundary Dice, then applies a GNN for supernode classification on the minor, and finally performs exact lifting via a bijection between the minor and the voxel grid. On the BraTS / KiTS / LiTS tumor segmentation tasks, SEMIR consistently outperforms nnU-Net on minority class Dice, requiring only a 16GB T4 GPU.
- UGround: Towards Unified Visual Grounding with Unrolled Transformers
-
UGround reverses the LMM-based visual grounding paradigm from "using the final layer \(\langle\text{SEG}\rangle\) token as prompt" to "using dynamically selected intermediate layer similarity maps as prompt." Through the RL-based SSC strategy, the \(\langle\text{SEG}\rangle\) token slides across all transformer layers, treating the similarity map as both a soft logit mask for SAM and a backward supervision signal. For the first time, it unifies five visual grounding tasks—RES, RS, FP-RES, gRES, Multi-RS—within a single framework, achieving cIoU +9.0% on ReasonSeg test and N-acc +12.1% on gRefCOCO val.
🔗 Causal Inference¶
- Causal Fine-Tuning under Latent Confounded Shift
-
This paper proposes Causal Fine-Tuning (CFT): embedding an SCM-inspired decomposition of "high-level stable feature \(C\) + low-level confounder-sensitive feature \(\Phi\)" into standard BERT fine-tuning, and using a front-door style do-calculus adjustment formula for prediction. CFT significantly outperforms SFT/SWA/WISE and other single-domain generalization baselines under text pseudo-correlation injection attacks.
- Controllable Generative Sandbox for Causal Inference
-
This paper proposes CausalMix: a variational generative framework that jointly optimizes a data-type-specific multi-head decoder and a Bayesian Gaussian mixture latent prior with three independently controllable causal "knobs" (overlap \(\alpha(X)\), CATE function \(\tau(X)\), unobserved confounding \(\kappa(X,T)\)). This enables users to freely design counterfactual benchmarks while maintaining fidelity to real data distributions. On real mCRPC (prostate cancer) cases, CausalMix demonstrates high-fidelity reproduction of mixed-type tabular data and stable, on-demand injection of overlap/confounding/heterogeneous effects, serving as a controllable stress test for CATE estimators.
- The (Marginal) Value of a Search Ad: An Online Causal Framework for Repeated Second-price Auctions
-
This paper models the true value of search ads as a treatment effect between “win” and “lose” outcomes. Under binary feedback in repeated second-price auctions (SPA), it designs an online causal learning algorithm that exploits the payment rule, achieving the minimax-optimal regret \(\widetilde\Theta(\sqrt{dT})\), which is strictly easier than learning in the corresponding first-price auction setting.
💻 Code Intelligence¶
- BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models
-
BoostAPR introduces a three-stage pipeline for "training program-repair models with RL": execution-verified SFT → training with both sequence-level and line-level rewards → during PPO, redistributing sequence rewards to key edit lines using the line-level model. On Qwen2.5-Coder-32B, it boosts SWE-bench Verified from 17.8% to 40.7% (+22.9pp), and achieves 24.8% on Defects4J via cross-lingual transfer.
- HE-SNR: Uncovering Latent Logic via Entropy for Guiding Mid-Training on SWE-bench
-
On SWE-bench, traditional PPL is affected by the "long context tax" and cannot predict post-SFT agent capabilities. This paper proposes the "entropy compression hypothesis" and the HE-SNR metric, which computes the signal-to-noise ratio only at "high-entropy decision points" where Top-10 entropy exceeds \((\ln 3 + \ln 4)/2\). This achieves a Pearson correlation of 0.96 and Kendall consistency of 0.98 with downstream SWE-bench scores.
🖼️ Image Restoration¶
- Hierarchical Image Tokenization for Multi-Scale Image Super Resolution
-
H-VAR re-slices the "residual quantization for multi-scale generation" VAR paradigm into hierarchical image tokenization (HIT), enabling a 310M small model to output three meaningful intermediate resolutions (128 / 256 / 512) in a single forward pass. A DPO regularization term, which does not require an external reward model, is added to bias outputs toward HR. On standard ISR datasets, it competes with the 1B-parameter VARSR.
- Image Restoration via Diffusion Models with Dynamic Resolution
-
SubDAPS / SubDAPS++ adapts pixel-space diffusion restoration methods like DPS and DAPS into a "dynamic resolution diffusion model" framework—sampling in \(64^2 / 128^2\) subspaces in early stages and returning to \(256^2\) full resolution later. It replaces Langevin with conjugate gradient, switches between stochastic/deterministic sampling via thresholding, and adds a corrector step that requires no extra network evaluation. On four linear and two nonlinear restoration tasks, it outperforms both pixel and latent diffusion methods on most metrics while being faster at inference.
🔍 Information Retrieval & RAG¶
- Hierarchical Abstract Tree for Cross-Document Retrieval-Augmented Generation
-
Ψ-RAG replaces RAPTOR's k-means with a "merge–collapse" hierarchical clustering to construct a cross-document abstract tree, paired with a retrieval-answering Agent capable of multi-turn rewriting and a hybrid sparse BM25 index. This enables Tree-RAG, for the first time, to match or even surpass Graph-RAG on corpus-level, cross-document multi-hop QA, achieving an average F1 25.9% higher than RAPTOR and 7.4% higher than HippoRAG 2.
- Very Efficient Listwise Multimodal Reranking for Long Documents
-
ZipRerank simultaneously eliminates the two main bottlenecks of VLM-based listwise reranking—"overly long visual token sequences" and "autoregressive decoding with per-token ranking output"—by employing query-aware token pruning and single-logit sorting. On MMDocIR, it reduces LLM inference latency by an order of magnitude while matching or surpassing the current SOTA MM-R5.
✏️ Knowledge Editing¶
- CrispEdit: Low-Curvature Projections for Scalable Non-Destructive LLM Editing
-
Formulates LLM editing as "minimize edit loss s.t. capability loss unchanged" constrained optimization, converts it via Bregman divergence equivalence to low-curvature subspace projection using the Gauss-Newton Hessian, and leverages K-FAC plus a Kronecker basis trick that avoids explicit projector construction. This enables 3000 edits on A40 in 6 minutes, while keeping LLaMA-3-8B's average drop on MMLU/IFEval/ARC-C/TruthfulQA/GSM8K under 1%, significantly outperforming AlphaEdit / MEMIT / fine-tuning.
- KORE: Enhancing Knowledge Injection for Large Multimodal Models via Knowledge-Oriented Controls
-
KORE injects new knowledge into LMMs via a two-stage "knowledge-oriented control": it automatically expands a single fact into structured multi-turn dialogues and instruction tasks (improving generalization), while initializing LoRA adapters in the null space of the covariance matrix of prior knowledge (minimizing interference with old capabilities). This achieves both strong adaptation and retention on LLaVA-v1.5 / Qwen2.5-VL.
🌐 Multilingual & Translation¶
- ML-Embed: Inclusive and Efficient Embeddings for a Multilingual World
-
ML-Embed extends the Matryoshka concept from one dimension (representation dimension) to three dimensions—embedding parameters (MEL), model depth (MLL), and representation dimension (MRL)—enabling full-stack nested training. It constructs a multilingual training set with 282 natural languages and 40 programming languages, totaling 50 million samples, and releases a family of open-source models from 140M to 8B parameters. On 17 MTEB benchmarks, it ranks first in 9, with notable gains in Polish (+22.89) and Vietnamese (+6.88).
- Optimizing Language Models for Crosslingual Knowledge Consistency
-
This paper addresses the issue where multilingual LLMs provide conflicting answers to the same question in different languages. It proposes an RL objective that uses the "log-likelihood of the answer in another language" as the reward, proves that the optimal policy takes a product-of-experts form and guarantees crosslingual preference consistency when \(\gamma_1\gamma_2=\beta^2\). Based on this, it derives the Direct Consistency Optimization (DCO) algorithm, which requires neither a reward model nor online sampling. DCO improves both crosslingual consistency (RankC) and answer accuracy across 9 LLMs, 3 multilingual QA benchmarks, and 26 languages.
🚗 Autonomous Driving¶
- DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving
-
DeepSight shifts "future world prediction" from explicit pixel reconstruction (codebook single-frame) to multi-frame parallel implicit prediction of DINOv3 semantic features in BEV space, with an additional on-demand Adaptive Chain-of-Thought. This enables Qwen2.5-VL-3B to achieve a Driving Score of 86.23 (+7.39) and Success Rate of 71.36% (+13.63) on Bench2Drive closed-loop, with only ~4% extra inference latency.
🧑 Human Understanding¶
- MotionGRPO: Overcoming Low Intra-Group Diversity in GRPO-Based Egocentric Motion Recovery
-
MotionGRPO reformulates egocentric full-body motion recovery from head-mounted devices as an MDP over diffusion sampling, employing GRPO with a "trajectory-conditioned perceptual model + 4 joint-level sub-rewards" hybrid reward for post-training. It identifies the critical bottleneck of "overly strong input conditions leading to nearly identical intra-group samples and vanishing advantage variance," and restores intra-group diversity by injecting Perlin noise into the conditions. On AMASS/RICH, it reduces MPJPE from EgoAllo's 124.985 mm to 114.207 mm.
🎯 Object Detection¶
- Smoothing Slot Attention Iterations and Recurrences
-
Addressing two long-standing but overlooked issues in Slot Attention for image and video object-centric learning—namely, "insufficient information in cold-start queries" and "forced unification of aggregation transformations for first/non-first frames"—the authors propose SmoothSA: a self-distillation-based lightweight warm-up module that injects sample-specific information into queries, and a scheduling scheme where the first frame undergoes three full iterations while non-first frames only run one. This approach achieves new SOTA on both image and video OCL benchmarks.
⚛️ Physics¶
- Neural QAOA\(^2\): Differentiable Joint Graph Partitioning and Parameter Initialization for Quantum Combinatorial Optimization
-
A generative-evaluative neural network (GEN) jointly differentiates the two tasks of "graph partitioning + quantum circuit parameter initialization" in QAOA²: the evaluator learns a high-fidelity quantum performance surrogate, and the generator, guided by its gradients, outputs discrete partitions and parameter initializations. With straight-through estimator (STE) and orthogonal complement head (OCH), the system is end-to-end trainable. Outperforms heuristic baselines on 183 QUBO/Ising/MaxCut instances (21-1000 variables), ranking first on 101 cases.
🎁 Recommender Systems¶
- Can Recommender Systems Teach Themselves? A Recursive Self-Improving Framework with Fidelity Control
-
RSIR enables sequential recommendation models to generate new synthetic user interaction sequences using their own predictive capabilities, retrain a new model, and filter out samples deviating from the user preference manifold via a ranking-based "fidelity check," thus preventing self-consuming model collapse. On 4 datasets × 3 mainstream backbones, it consistently improves NDCG/Recall by 4–11%, and theoretically proves this process is equivalent to implicit regularization along the tangent space of the user preference manifold.
📂 Others¶
- Active Tabular Augmentation via Policy-Guided Diffusion Inpainting
-
This paper formalizes the "fidelity-utility gap" in tabular augmentation (where generators optimize for distribution matching, but augmentation value comes from low-density regions), and proposes the TAP algorithm, which uses diffusion inpainting for manifold-constrained proposals, policy-guided utility-aligned selection, and conservative windowed submission with hard constraint gating. On 7 real tabular datasets, TAP improves classification accuracy by up to 15.6% and reduces regression RMSE by 32% compared to baselines.
- Adaptive Multi-Round Allocation with Stochastic Arrivals
-
This work formalizes network recruitment as a budget-constrained sequential control problem, proves that the single-round optimal allocation is greedy; reduces multi-round planning to \(O(b^5\log b)\) complexity via a population-level surrogate value function, and provides robustness guarantees under model misspecification by decomposing errors into frontier, population, and approximation types.
- AI Cap-and-Trade: Efficiency Incentives for Accessibility and Sustainability
-
Drawing inspiration from carbon cap-and-trade, the authors propose a quota-trading market for AI inference FLOPs (AI Allowance). Using KKT conditions, they prove that under reasonable parameters, this mechanism strictly reduces FLOP usage across companies, thereby simultaneously mitigating both the energy consumption of large models and the market exclusion of smaller companies.
- Cascaded Flow Matching for Heterogeneous Tabular Data with Mixed-Type Features
-
TabCascade decomposes each table row into "low-resolution (categorical + discretized numerical)" and "high-resolution (continuous numerical)" cascaded stages: first, CDTD learns the low-res joint distribution; then, flow matching generates numerical details conditioned on the low-res output, with data-dependent coupling and a learnable nonlinear time schedule to tighten transport cost. It natively supports generation of mixed-type features such as missing values and zero-inflation, achieving a 51.9% improvement in detection score over SOTA on 12 datasets.
- Complexity as Advantage: A Regret-Based Perspective on Emergent Structure
-
This paper proposes Complexity-as-Advantage (CAA): redefining "complexity" as the regret dispersion among a family of resource-bounded observers on the same process. It is shown that, under the log-loss + Markov framework, this is equivalent to the sum of conditional mutual information atoms (recovering excess entropy), and from a coding perspective, to the variance of excess description length (MDL). Thus, Kolmogorov complexity, Bennett logical depth, and excess entropy are unified into a computable, empirically estimable scalar spectrum.
- Decision Tree Learning on Product Spaces
-
This work extends the theoretical guarantees of "top-down greedy decision tree heuristics" from Blanc et al. (ITCS'20) from the uniform distribution to arbitrary product distributions, providing an upper bound of \(\exp(\Delta_\mathrm{opt} D_\mathrm{opt}\log(e/\epsilon))\) (strictly tighter than ITCS'20 in the full binary tree case), and is completely parameter-free—it can be run without prior knowledge of the optimal tree size or depth.
- Estimating Correlation Clustering Cost in Node-Arrival Stream
-
This paper studies the problem of approximating the cost of correlation clustering in the "node-arrival" data stream model. The authors propose the C4Approx algorithm, which achieves a \((O(1), n^{1-\alpha})\)-approximation using \(O(n^{(3+\alpha)/4}\log n)\) words of sublinear space and a constant number of passes. Two matching lower bounds are provided, showing that both multi-pass and additive error are unavoidable. On real data, storing only 2% of nodes suffices to match the performance of Pivot.
- From Generalist to Specialist Representation
-
This paper provides the first fully nonparametric (no intervention, no functional constraints) two-layer hierarchical identifiability proof: the temporal-task structure is identifiable via CI tests from the collider perspective, and task-relevant latents can be separated from generalist representations using sparsity regularization.
- From Human-Level AI Tales to AI Leveling Human Scales
-
This paper uses LLMs as population extrapolators, calibrating 18 ability dimensions on a "world population accuracy" logarithmic scale \(L=-\log_B p_W\). It finds that the true base for Volume / Attention dimensions is \(B \gg 10\), while for Comprehension \(B \approx 1\), revealing that current AI-human comparisons are fundamentally misaligned.
- GEM-FI: Gated Evidential Mixtures with Fisher Modulation
-
This paper addresses two key issues in evidential deep learning (EDL): overconfidence on out-of-distribution (OOD) samples and the inability of single-head models to capture multimodal epistemic uncertainty. It proposes a three-part solution—GEM-Core/MIX/FI: using learned feature energy to gate evidence, employing a mixture of evidential heads to approximate ensemble behavior in a single inference pass, and introducing Fisher information regularization to stabilize mixture weights. On OOD detection tasks such as CIFAR-10→SVHN/CIFAR-100, the method outperforms DAEDL while maintaining single-pass inference.
- DynaDiff: Generative Adaptation of Dynamics to Environmental Shifts via Weight-space Diffusion
-
DynaDiff reframes the meta-learning problem of "training a predictor for a new environment" as a conditional sampling task of "directly generating the full network weights using a diffusion model." Leveraging weight graphs, function-consistency loss, and a dynamics-aware prompter, it achieves an average RMSE reduction of 10.78% over strong baselines across four PDE systems.
- HEDP: A Hybrid Energy-Distance Prompt-based Framework for Domain Incremental Learning
-
Inspired by the physical intuition of Helmholtz free energy, each domain's prompt parameters are trained to form an "energy curve compressed to boundary \(\Theta\) and aligned to midline \(\Delta\)." During inference, an energy factor and a distance factor are jointly used to weight each domain prompt. This approach improves performance on unknown domains by 1.76 / 3.12 / 2.57 percentage points on the CDDB / DomainNet / CORe50 DIL benchmarks, respectively.
- Local and Mixing-Based Algorithms for Gaussian Graphical Model Selection from Glauber Dynamics
-
This work is the first to study the problem of learning Gaussian graphical model structure from a single trajectory of Gaussian Glauber dynamics. Two complementary algorithms are proposed: LET-GL (local edge testing based on i,i,j,i windows, perfectly parallelizable) and BTR-GL (under the Dobrushin condition, uses burn-in/thinning to "decorrelate" the trajectory into approximately i.i.d. samples, which are then fed to existing i.i.d. learners). The paper provides finite-sample recovery guarantees, information-theoretic lower bounds, and an independently valuable total variation mixing bound for the random-scan Gaussian Gibbs sampler.
- Local Hessian Spectral Filtering for Robust Intrinsic Dimension Estimation
-
This paper proposes LHSD, which applies a Hill-type spectral filter to the log-density Hessian of a score model, retaining only near-zero eigenvalues to count the dimension of the tangent space. Stochastic Lanczos Quadrature reduces the computational cost from \(\mathcal{O}(D^3)\) to \(\mathcal{O}(D)\), enabling stable estimation of local intrinsic dimension in 3072-dimensional image spaces, and is used to diagnose memorization of training samples in diffusion models.
- Matroid Algorithms Under Size-Sensitive Independence Oracles
-
The authors propose a size-sensitive matroid oracle model where the query cost grows linearly with the size of the queried set, and prove that under this model, the optimal query cost for finding a basis, estimating rank, and estimating the partition number is all \(\tilde{\Theta}(n^2)\). For matroids with bounded circuit size \(c\), they provide an \(\mathcal{O}(n^{2-1/c}\log n)\) algorithm for maximum weight basis, breaking the quadratic lower bound.
- Mitigating Label Shift in Tabular In-Context Learning via Test-Time Posterior Adjustment
-
For TabPFN-type "tabular foundation models" that feed the training set directly as in-context input to attention, this work proposes posterior correction—finding that such models severely overfit the majority class in the training set. The authors introduce DistPFN: a one-line posterior reweighting \(\tilde{p}(y) \propto \hat{p}(y)^2 / p_{train}(y)\), which lifts TabPFN-v2 accuracy under strong label shift (\(\beta=5\)) from 72.7% to 76.9% on 253 OpenML datasets—without retraining, estimating test priors, or modifying the architecture.
- Mixture Prototype Flow Matching for Open-Set Supervised Anomaly Detection
-
MPFM replaces the traditional "unimodal Gaussian prototype" in OSAD with a learnable Gaussian mixture prototype space, directly regresses a GMM-form velocity field via flow matching, and adds a mutual information maximization regularizer to prevent prototype collapse. On 9 industrial/medical AD datasets, under the 10/1 anomaly sample setting, it outperforms all SOTA methods including DRA, AHL, and DPDL.
- Networked Information Aggregation for Binary Classification
-
Extends the Kearns-Roth-Ryu 2026 result—"sequentially passing prediction columns among linear regression agents on a DAG nearly achieves global optimum"—to binary classification: each agent observes only a subset of feature columns and sequentially forwards its logit to downstream agents. Under the \(M\)-coverage condition, this achieves global logistic regression optimum with \(O(M/\sqrt{D})\) excess BCE loss; a matching hard instance proves an \(\Omega(k/D)\) lower bound, characterizing network depth as the fundamental bottleneck for information aggregation.
- New Bounds for Kernel Sums via Fast Spherical Embeddings
-
By accelerating the Bartal-Recht-Schulman 2011 "randomized Nash device" spherical embedding theorem using iterative Fastfood transforms (time \(\widetilde{O}(d + \Lambda^2 + \varepsilon^{-2})\)), and using it as a preprocessing step for Gaussian KDE to compress the diameter to \(\widetilde{O}(1/\sqrt{\varepsilon})\), this work obtains a new Gaussian KDE query time bound \(\widetilde{O}(d + \varepsilon \Delta_\sigma^2 + 1/\varepsilon^3)\), which outperforms RFF / FJLT+RFF / Fastfood in the regime of small \(\varepsilon\) and moderate diameter.
- NonZero: Interaction-Guided Exploration for Multi-Agent Monte Carlo Tree Search
-
An asinh-linked GLM surrogate compresses the multi-agent MCTS joint-action space \(d^n\) into a low-dimensional nonlinear bandit. Using "first-order difference + second-order mixed difference" as the NonUCT proposal rule, only a small candidate set \(\mathcal{C}(s)\) is maintained at each node. It is proven to achieve \(\widetilde{O}(T^{3/4})\) local regret (independent of \(d^n\)). On MatGame/SMAC/SMACv2, both sample efficiency and final performance surpass strong baselines such as MAZero.
- Polaris: Coupled Orbital Polar Embeddings for Hierarchical Concept Learning
-
Polaris decomposes concept representations into two decoupled signals—"direction (semantics)" and "orbital potential (hierarchy)"—and learns both on the unit hypersphere: tangent space projection plus exponential mapping ensures manifold closure, anisotropic spherical SVGD prevents equatorial concentration, and vMF KL divergence implements the asymmetric "parent should have higher entropy than child" constraint. On taxonomy expansion tasks, Polaris improves top-K recall by up to 19 points and reduces mean rank by 60%.
- Possibilistic Predictive Uncertainty for Deep Learning
-
This paper replaces the Bayesian probability framework with possibility theory and proposes DAPPr—a method that projects the possibilistic posterior in parameter space onto the prediction space via supremum, fits it with a learnable Dirichlet possibility function, and ultimately yields a cognitive uncertainty modeling approach that requires only 10 lines of code, can directly replace cross-entropy, and outperforms the EDL family in OOD detection.
- Provably Data-driven Multiple Hyper-parameter Tuning with Structured Loss Function
-
This work employs "real algebraic geometry + first-order logic quantifier elimination" to provide the first provable generalization bound for multi-dimensional hyperparameter tuning, extending the Balcan 2025 framework—which was limited to one-dimensional scalar hyperparameters—to arbitrary \(p\) dimensions, bilevel validation losses, approximate inner optimization, and other practical scenarios. It also provides the first matching lower bound.
- Realizable Bayes-Consistency for General Metric Losses
-
This paper provides a sharp characterization for the open problem of "when does a hypothesis class \(\mathcal{H}\) admit a distribution-free, strongly universally Bayes-consistent learning algorithm under general (possibly unbounded) metric losses" in the realizable case—the necessary and sufficient condition is that \(\mathcal{H}\) does not contain a new type of "unbounded gap Littlestone tree" combinatorial obstruction.
- Position: Reliable AI Needs to Externalize Implicit Knowledge: A Human-AI Collaboration Perspective
-
This ICML position paper argues that all current AI reliability methods (RAG / Self-Consistency / RLHF / Agent Memory) can only verify explicit knowledge, while the true power of AI comes from the 80-95% of "implicit knowledge" in training data that has never been formally recorded by humans. The author proposes Knowledge Objects (KOs) as infrastructure—externalizing AI's implicit reasoning into structured artifacts that humans can inspect, verify, and endorse, enabling the cost of a single human verification to compound across the community over time.
- Scaling Continual Learning to 300+ Tasks with Bi-Level Routing Mixture-of-Experts
-
The authors propose CaRE: inserting a bi-level routing MoE (BR-MoE) into each ViT block—first, a "class recognizer" selects the Top-M relevant task routers based on entropy, then each router activates its Top-K task experts and adds a shared EMA expert. This enables retention of old knowledge and continual absorption of new classes even with 300+ tasks, filling the gap in "long-sequence CIL" (and releasing the 1000-class OmniBenchmark-1K benchmark).
- Singular Bayesian Neural Networks
-
This work directly parameterizes the weight matrix as \(W=AB^\top\) instead of applying a mean-field distribution to \(W\) itself, thereby inducing a low-rank posterior that is singular with respect to the Lebesgue measure. The number of parameters is reduced from \(O(mn)\) to \(O(r(m+n))\), and the PAC-Bayes complexity is tightened from \(\sqrt{mn}\) to \(\sqrt{r(m+n)}\). On MLP/LSTM/Transformer architectures, the method achieves OOD detection performance surpassing 5-member Deep Ensembles with \(33\times\) fewer parameters.