Skip to content

🏥 Medical Imaging

🔬 ICLR2026 · 72 paper notes

Adaptive Domain Shift in Diffusion Models for Cross-Modality Image Translation

This paper proposes CDTSDE, a framework that embeds a learnable spatial-adaptive domain mixing field \(\Lambda_t\) into the reverse SDE of diffusion models, enabling cross-modality translation paths to traverse low-energy manifolds. The approach achieves higher fidelity with fewer denoising steps on MRI modality conversion, SAR→Optical, and industrial defect semantic mapping tasks.

Adaptive Test-Time Training for Predicting Need for Invasive Mechanical Ventilation in Multi-Center Cohorts

This paper proposes AdaTTT, a framework that achieves robust test-time adaptation on multi-center ICU EHR data for 24-hour-ahead invasive mechanical ventilation (IMV) prediction, via dynamic feature-aware self-supervised learning (adaptive masking strategy) and prototype-guided partial optimal transport alignment.

AFD-INSTRUCTION: A Comprehensive Antibody Instruction Dataset with Functional Annotations for LLM-Based Understanding and Design

This work constructs AFD-Instruction, the first large-scale antibody functional annotation instruction dataset (430K+ entries), aligning antibody sequences with natural-language functional descriptions via a multi-agent literature extraction pipeline. The dataset is used to instruction-tune general-purpose LLMs for antibody understanding and function-guided design, achieving an average accuracy improvement of 20+ points across five classification tasks.

An Orthogonal Learner for Individualized Outcomes in Markov Decision Processes

This paper systematically introduces semiparametric efficiency theory from causal inference into Q-function estimation for MDPs. It demonstrates that classical Q-regression and FQE are essentially naive plug-in learners subject to plug-in bias, and proposes the DRQQ-learner—a meta-learner that simultaneously achieves double robustness, Neyman orthogonality, and near-oracle efficiency. By deriving the efficient influence function (EIF) to construct a debiased two-stage loss, the method comprehensively outperforms baselines in Taxi and Frozen Lake environments.

AntigenLM: Structure-Aware DNA Language Modeling for Influenza

AntigenLM is a GPT-2-style DNA language model that preserves the integrity of genomic functional units. Pretrained on complete influenza virus whole genomes and subsequently fine-tuned, it autoregressively predicts antigenic sequences of future circulating strains, achieving significantly lower amino acid mismatch rates than the evolutionary model beth-1 and general-purpose genomic models.

ATPO: Adaptive Tree Policy Optimization for Multi-Turn Medical Dialogue

This paper proposes ATPO (Adaptive Tree Policy Optimization), which models multi-turn medical dialogue as a hierarchical Markov decision process (H-MDP). ATPO dynamically allocates rollout budgets via an uncertainty-aware adaptive tree expansion mechanism, using a composite uncertainty measure combining Bellman error and action-value variance to guide exploration. With Qwen3-8B, ATPO surpasses GPT-4o on three medical dialogue benchmarks.

Augmenting Representations with Scientific Papers

This paper proposes the first multimodal foundation model framework that aligns X-ray spectra with scientific literature via contrastive learning, achieving 20% Recall@1% cross-modal retrieval in a shared latent space, improving physical parameter estimation by 16–18%, and discovering rare astrophysical objects including candidate pulsating ultraluminous X-ray sources.

Benchmarking ECG FMs: A Reality Check Across Clinical Tasks

A comprehensive "reality check" benchmark evaluating 8 ECG foundation models across 12 datasets and 26 clinical tasks reveals that the compact structured state space model (SSM) ECG-CPC outperforms large-scale Transformers in 5 out of 7 task categories, demonstrating that architectural design matters more than model scale.

BiomedSQL: Text-to-SQL for Scientific Reasoning on Biomedical Knowledge Bases

This paper introduces BiomedSQL, the first benchmark specifically designed to evaluate the scientific reasoning capabilities of Text-to-SQL systems on biomedical knowledge bases. It comprises 68,000 question/SQL/answer triples and reveals a substantial gap between the best-performing model (GPT-o3-mini, 62.6%) and domain experts (90%).

Boosting Medical Visual Understanding From Multi-Granular Language Learning

This paper proposes Multi-Granular Language Learning (MGLL), a plug-and-play contrastive learning framework that jointly optimizes a soft CLIP loss, a point-wise loss, and a smooth KL divergence to align medical images with multi-label, multi-granular text descriptions. MGLL consistently surpasses state-of-the-art methods on fundus and X-ray datasets, and when used as a visual encoder for multimodal large language models, improves diagnostic accuracy by up to 34.1%.

Bridging Explainability and Embeddings: BEE Aware of Spuriousness

This paper proposes the BEE framework, which identifies and names spurious correlations (SCs) directly from learned classifier weights by analyzing how fine-tuning perturbs the weight-space geometry of pre-trained representations. The method requires no counterfactual samples and can discover hidden dataset biases. On ImageNet-1k, BEE uncovers spurious associations that reduce accuracy by up to 95%.

Can SAEs Reveal and Mitigate Racial Biases of LLMs in Healthcare?

This paper investigates whether Sparse Autoencoders (SAEs) can reveal and mitigate racial biases in LLMs within clinical settings. SAEs successfully identify harmful race-associated features (e.g., co-activation of "Black" with violence-related terms), but their effectiveness at bias mitigation in complex clinical tasks is limited (FLDD < 3%), falling far short of simple prompting strategies (FLDD 8–15%).

CARE: Towards Clinical Accountability in Multi-Modal Medical Reasoning with an Evidence-Grounded Agentic Framework

This paper proposes CARE, a framework that decomposes medical VQA into a three-stage expert pipeline—entity proposal → referring segmentation → evidence-grounded QA—with RLVR fine-tuning applied to each VLM and GPT-5 serving as a dynamic coordinator for tool planning and CoT review. CARE achieves an average accuracy of 77.54% across four medical VQA benchmarks using only 10B parameters, surpassing the 32B end-to-end state-of-the-art (72.29%).

Causal Interpretation of Neural Network Computations with Contribution Decomposition

This paper proposes CODEC (Contribution Decomposition), which applies Integrated Gradients to compute the contribution of hidden-layer neurons to the output (rather than analyzing activations alone), and then decomposes these contributions into sparse modes via a Sparse Autoencoder. This approach achieves stronger causal interpretability and network control than activation-based analysis, and is successfully applied to ResNet-50 and a retinal biological neural network model.

Characterizing Human Semantic Navigation in Concept Production as Trajectories in Embedding Space

This paper proposes modeling the human concept production process as cumulative trajectories in Transformer embedding space, defining 5 kinematic metrics (distance, velocity, acceleration, entropy, and centroid distance). Evaluated on 4 datasets spanning 3 languages and covering neurodegenerative disease, taboo word fluency, and attribute listing tasks, the framework successfully distinguishes clinical groups and concept categories, with highly consistent results across different embedding models.

COMPASS: Robust Feature Conformal Prediction for Medical Segmentation Metrics

COMPASS constructs conformal prediction intervals by applying linear perturbations along the low-dimensional subspace most sensitive to the target metric within the intermediate feature space of a segmentation network. It achieves significantly narrower prediction intervals than conventional CP methods across four medical segmentation tasks while maintaining valid coverage.

ConfHit: Conformal Generative Design with Oracle Free Guarantees

This paper proposes ConfHit, a framework that employs density-ratio-weighted conformal permutation p-values to perform certification (determining whether a generated batch contains a hit) and design (pruning the candidate set while preserving statistical guarantees). Without requiring an experimental oracle and under distributional shift, ConfHit provides finite-sample \(1-\alpha\) coverage guarantees for generative molecular design.

Controllable Sequence Editing for Biological and Clinical Trajectories

This paper proposes Clef, a controllable sequence editing model based on temporal concepts that performs immediate and delayed editing of biological/clinical multivariate trajectories under given conditions (e.g., drugs, surgery). On cell reprogramming and patient laboratory test data, Clef achieves 16.28% MAE improvement for immediate editing, 26.73% for delayed editing, and up to 62.84% improvement for zero-shot counterfactual generation.

Controlling Repetition in Protein Language Models

This work presents the first systematic study of pathological repetition in protein language models (PLMs), introducing a unified repetition metric \(R(x)\) and a utility metric \(U(x)\), and proposes UCCS (Utility-Controlled Contrastive Steering), a method that injects steering vectors decoupled from repetition into hidden layers at inference time to suppress repetition while preserving folding credibility without retraining the model.

CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of LLMs in Mental Health QA

CounselBench is a two-component benchmark constructed with 100 licensed mental health professionals — CounselBench-EVAL (2,000 expert annotations across six clinical dimensions) and CounselBench-Adv (120 adversarial questions with 1,080 annotated responses) — systematically revealing that LLMs achieve superficially high scores in mental health open-ended QA while exhibiting safety risks such as over-generalization and unsolicited medical advice, and demonstrating that LLM-as-Judge is severely unreliable in safety-critical domains.

CryoNet.Refine: A One-step Diffusion Model for Rapid Refinement of Structural Models with Cryo-EM Density Map Restraints

CryoNet.Refine is proposed as the first AI-based framework for cryo-EM atomic model refinement. It integrates a one-step diffusion model initialized from Boltz-2 weights, a novel differentiable density generator that physically simulates synthetic density maps, and the first use of density map correlation (cosine similarity) as a differentiable loss function, jointly optimized with geometric constraint losses including Ramachandran, rotamer, and bond angle terms. A test-time optimization strategy enables per-case customization. The method comprehensively outperforms Phenix.real_space_refine on 120 protein and DNA/RNA complex benchmarks (CC_mask: 0.59 vs. 0.54; Ramachandran favored: 98.92%).

Decentralized Attention Fails Centralized Signals: Rethinking Transformers for Medical Time Series

This paper proposes the TeCh framework, whose core contribution is the CoTAR (Core Token Aggregation-Redistribution) module, which replaces standard attention in Transformers to model channel dependencies in medical time series. By introducing a global "core token" as a proxy — first aggregating information from all channels and then redistributing it back — the computational complexity is reduced from \(O(n^2)\) to \(O(n)\). On the APAVA dataset, TeCh achieves 86.86% accuracy (surpassing Medformer by 12.13%) while consuming only 33% of the memory and 20% of the inference time.

Deep Hierarchical Learning with Nested Subspace Networks for Large Language Models

This paper proposes Nested Subspace Networks (NSN), which reparameterize linear layers via low-rank decomposition into a strictly nested subspace hierarchy. Combined with uncertainty-aware multi-rank training, a single model can instantaneously trade off computation against performance at test time (50% FLOPs reduction with only 5% accuracy loss), and can be applied post-hoc to pretrained LLMs.

DISCO: Densely-overlapping Cell Instance Segmentation via Adjacency-aware Collaborative Coloring

This paper formulates densely-overlapping cell instance segmentation as a graph coloring problem and proposes Disco, a divide-and-conquer framework combining explicit conflict node marking with implicit adjacency-constrained disambiguation. By decomposing cell adjacency graphs via BFS and introducing five collaborative loss functions, Disco achieves a 7.08% PQ improvement on the high-density pathology dataset GBC-FS 2025 while attaining state-of-the-art performance across four heterogeneous datasets.

Discrete Diffusion Trajectory Alignment via Stepwise Decomposition

This paper proposes SDPO (Stepwise Decomposition Preference Optimization), which decomposes the trajectory alignment problem of discrete diffusion models into stepwise posterior alignment subproblems, avoiding the difficulty of backpropagating gradients through the entire denoising chain. SDPO achieves significant improvements over existing methods across three tasks: DNA sequence design, protein inverse folding, and language modeling.

DistMLIP: A Distributed Inference Platform for Machine Learning Interatomic Potentials

DistMLIP is a distributed inference platform based on a zero-redundancy graph-level parallelization strategy that addresses the lack of multi-GPU support in existing machine learning interatomic potentials (MLIPs). On 8 GPUs, it enables simulations approaching one million atoms, achieving up to 8× speedup over spatial partitioning methods while supporting systems 3.4× larger.

Distributional Consistency Loss: Beyond Pointwise Data Terms in Inverse Problems

This paper proposes the Distributional Consistency (DC) loss, which replaces conventional pointwise data fidelity terms (e.g., MSE/NLL) with distribution-level calibration, thereby eliminating overfitting to noise. The approach achieves significant performance gains in DIP-based denoising and PET image reconstruction without requiring early stopping.

DM4CT: Benchmarking Diffusion Models for Computed Tomography Reconstruction

DM4CT is proposed as the first systematic benchmark for diffusion-based CT reconstruction, encompassing ten diffusion methods and seven baselines evaluated comprehensively across medical, industrial, and synchrotron radiation datasets, revealing both the strengths and limitations of diffusion models in CT reconstruction.

DriftLite: Lightweight Drift Control for Inference-Time Scaling of Diffusion Models

DriftLite exploits the inherent degrees of freedom between the drift and potential function in the Fokker-Planck equation to actively stabilize particle weights by solving a lightweight linear system for the optimal control drift at each step. This approach addresses weight degeneracy in Sequential Monte Carlo (SMC) at minimal computational cost, substantially outperforming Guidance-SMC baselines on Gaussian mixture, molecular system, and protein–ligand co-folding tasks.

Dual Distillation for Few-Shot Anomaly Detection

This paper proposes D24FAD, a dual distillation framework that combines teacher-student distillation on query images (TSD) and student self-distillation on support images (SSD), augmented by a learning-to-weight mechanism (L2W) for adaptive support importance estimation. The method achieves 100% AUROC on the APTOS fundus dataset with only 2-shot support.

EMR-AGENT: Automating Cohort and Feature Extraction from EMR Databases

This paper proposes EMR-AGENT, the first LLM agent-based framework for automated EMR preprocessing. By replacing hand-crafted rules with dynamic SQL interaction, it achieves cross-database cohort selection, feature extraction, and code mapping, demonstrating strong performance and generalization on MIMIC-III, eICU, and SICdb.

EvoFlows: Evolutionary Edit-Based Flow-Matching for Protein Engineering

EvoFlows proposes an edit-based flow matching approach that learns mutational trajectories between evolutionarily related protein sequences, enabling controllable numbers of edits (insertions, deletions, substitutions) on a template sequence while jointly predicting what to mutate and where to mutate.

Exo-Plore: Exploring Exoskeleton Control Space through Human-Aligned Simulation

This paper proposes the Exo-plore framework, which combines neuromechanical simulation with deep reinforcement learning to optimize hip exoskeleton control parameters without requiring human subject experiments, and generalizes to pathological gait scenarios.

ExpGuard: LLM Content Moderation in Specialized Domains

This paper proposes ExpGuard, a safety guardrail model targeting specialized domains such as finance, healthcare, and law, along with a companion dataset ExpGuardMix (58,928 samples). ExpGuard achieves prompt classification F1 exceeding WildGuard by 8.9% and response classification by 15.3% on domain-specific test sets, while maintaining state-of-the-art performance on general safety benchmarks.

Exploiting Low-Dimensional Manifold of Features for Few-Shot Whole Slide Image Classification

This paper identifies that pathology foundation model (PFM) features reside on a low-dimensional manifold (effective rank only 29.7 out of 512 dimensions), and that standard linear layers destroy this geometric structure, causing few-shot overfitting. The authors propose a plug-and-play MR Block — combining a frozen random matrix as a geometric anchor with a low-rank residual path for task adaptation — achieving state-of-the-art performance on few-shot WSI classification.

Fine-Tuning Diffusion Models via Intermediate Distribution Shaping

This work unifies rejection-sampling-based fine-tuning methods under the GRAFT framework, proving that they implicitly perform KL-regularized reward maximization. Building on this, P-GRAFT is proposed to perform distribution shaping at intermediate denoising steps (achieving a better bias–variance trade-off), and Inverse Noise Correction is introduced to improve flow model quality without reward signals, yielding an 8.81% VQAScore improvement on text-to-image generation.

From Conversation to Query Execution: Benchmarking User and Tool Interactions for EHR Database Agents

This paper proposes EHR-ChatQA, the first benchmark to evaluate the end-to-end interactive workflow of database agents in electronic health record (EHR) settings — covering ambiguity clarification, terminology mismatch resolution, SQL generation, and answer return. Evaluation reveals that the strongest model (o4-mini) achieves Pass@5 above 90% but suffers a substantial drop in Pass∧5 (all-success rate), with a gap of up to 60%, exposing critical robustness deficiencies in safety-sensitive clinical domains.

Fusing Pixels and Genes: Spatially-Aware Learning in Computational Pathology

This paper proposes the Stamp framework, which leverages spatial transcriptomics gene expression data as a supervisory signal. Through spatially-aware gene encoder pretraining and hierarchical multi-scale contrastive alignment, it enables joint representation learning of pathology images and spatial transcriptomics data, achieving state-of-the-art performance across 4 downstream tasks on 6 datasets.

Glance and Focus Reinforcement for Pan-cancer Screening

This paper proposes GF-Screen, a two-stage framework in which a lightweight Glance model employs reinforcement learning to rapidly localize CT sub-volumes containing lesions, while a Focus model performs fine-grained segmentation exclusively on the selected regions. By transferring GRPO's group-relative comparison paradigm from NLP to visual sub-volume groups, the method achieves RL optimization without a value network for the first time in a purely visual task. On the FLARE25 pan-cancer challenge, GF-Screen outperforms the champion solution by +25.6% DSC while achieving 5.7× faster inference.

HistoPrism: Unlocking Functional Pathway Analysis from Pan-Cancer Histology via Gene Expression Prediction

This paper proposes HistoPrism, an efficient Transformer architecture that injects cancer-type conditioning via cross-attention to predict pan-cancer gene expression from H&E histology images. It further introduces the Gene Pathway Coherence (GPC) evaluation framework based on Hallmark/GO pathways, achieving substantial improvements over STPath at the pathway level—particularly on low-variance, biologically fundamental pathways.

How to Make the Most of Your Masked Language Model for Protein Engineering

This work proposes a temperature-annealed stochastic beam search (SBS) sampling method for masked language models (MLMs), leveraging a wild-type marginal approximation of pseudo-log-likelihood (PLL) for efficient full-sequence evaluation. In vitro experiments on real therapeutic antibody optimization demonstrate that the choice of sampling algorithm is at least as important as model selection; SBS with guidance achieves a 100% success rate.

Human Behavior Atlas: Benchmarking Unified Psychological and Social Behavior Understanding

This work introduces Human Behavior Atlas—the first large-scale multimodal unified benchmark for behavior understanding spanning four dimensions (affective, cognitive, pathological, and social processes) with 101K+ samples—and trains three OmniSapiens-7B model variants to validate its effectiveness in multi-task training and transfer learning.

Improving 2D Diffusion Models for 3D Medical Imaging with Inter-Slice Consistent Stochasticity

This paper proposes Inter-Slice Consistent Stochasticity (ISCS), which generates inter-slice correlated noise via spherical linear interpolation (Slerp) during the re-noising step of diffusion sampling, eliminating inter-slice discontinuity artifacts in 3D medical reconstruction with 2D diffusion priors at their root cause — with zero additional computation, hyperparameters, or training overhead, and plug-and-play compatibility with any 2D diffusion inverse problem solver, yielding consistent improvements on sparse-view CT, limited-angle CT, and MRI super-resolution.

Incentives in Federated Learning with Heterogeneous Agents

This paper analyzes incentive problems in heterogeneous federated learning from a game-theoretic perspective, proves the existence of pure-strategy Nash equilibria under heterogeneous data distributions and PAC accuracy objectives, and proposes a linear programming-based approximation algorithm to determine optimal contribution levels.

Inference-Time Dynamic Modality Selection for Incomplete Multimodal Classification

This paper proposes DyMo, an inference-time dynamic modality selection framework that derives a theoretically grounded MTIR reward function (based on a classification-loss-reduction proxy + class prototype distance + intra-class similarity calibration) to iteratively and selectively fuse reliable recovered modalities at inference time, offering the first systematic resolution of the discarding-imputation dilemma: discarding missing modalities loses task-relevant information, while imputation may introduce noise.

Intrinsic Lorentz Neural Network

This paper proposes ILNN, a fully intrinsic hyperbolic neural network in which all operations are performed entirely within the Lorentz model, eliminating the geometric inconsistencies introduced by Euclidean operations in existing methods. ILNN achieves state-of-the-art performance on image classification, genomics, and graph classification tasks.

Knowledgeable Language Models as Black-Box Optimizers for Personalized Medicine

This paper proposes LEON (LLM-based Entropy-guided Optimization with kNowledgeable priors), a mathematically rigorous framework that models personalized treatment design as a constrained conditional black-box optimization problem. Through entropy constraints and an adversarial source critic, LEON guides an LLM to serve as a zero-shot optimizer that proposes personalized treatment plans without any fine-tuning.

Learning Domain-Aware Task Prompt Representations for Multi-Domain All-in-One Image Restoration

This paper proposes DATPRL-IR, the first multi-domain all-in-one image restoration method, which learns domain-aware task prompt representations via a dual prompt pool (task prompt pool + domain prompt pool). Domain priors are distilled from an MLLM and injected into the backbone through adaptive gated fusion, achieving significant improvements over SOTA across 9 tasks spanning natural, medical, and remote sensing domains.

Learning Patient-Specific Disease Dynamics with Latent Flow Matching for Longitudinal Imaging Generation

This paper proposes the Δ-LFM framework, which employs an ArcRank loss to construct patient-specific, temporally aligned trajectories in latent space (directionally consistent and monotonically increasing in magnitude). The framework extends the flow matching time range from \([0,1]\) to \([0,T]\) (actual time intervals) to enable prediction at arbitrary time points. Δ-LFM comprehensively outperforms eight baseline methods across three Alzheimer's longitudinal MRI benchmarks and introduces a progression-specific evaluation metric, Δ-RMAE.

mCLM: A Modular Chemical Language Model that Generates Functional and Makeable Molecules

This paper proposes mCLM (Modular Chemical Language Model), which represents molecules as sequences of synthesizable building blocks, enabling LLMs to generate molecules that simultaneously satisfy pharmacological function and automated synthesis feasibility. mCLM achieves significant improvements in pharmacokinetic and toxicity properties across 430 FDA-approved drugs.

MedAgentGym: A Scalable Agentic Training Environment for Code-Centric Reasoning in Biomedical Data Science

This work introduces MedAgentGym, the first unified agentic training environment for biomedical data science, comprising 72,413 task instances spanning 12 real-world scenarios and 129 categories, equipped with an executable sandbox and verifiable ground truth. A systematic benchmark evaluation of 29 LLMs reveals a substantial gap between commercial and open-source models. By combining efficient multi-threaded trajectory sampling with offline/online RL, the authors train Med-Copilot, achieving gains of +43.02%/+45.28% respectively and attaining performance competitive with GPT-4o.

MMedAgent-RL: Optimizing Multi-Agent Collaboration for Multimodal Medical Reasoning

This paper proposes MMedAgent-RL, a multi-agent system that simulates clinical consultation workflows (triage → specialist → attending physician) optimized via reinforcement learning. The core innovation is Curriculum-guided Multi-Agent Reinforcement Learning (C-MARL) with entropy-aware exploration, enabling the attending physician agent to adopt differentiated explore–exploit strategies when faced with correct, conflicting, or erroneous specialist opinions. The system achieves state-of-the-art performance on 5 medical VQA benchmarks spanning both in-domain and out-of-domain settings.

Moving Beyond Medical Exams: A Clinician-Annotated Fairness Dataset of Real-World Tasks and Ambiguity in Mental Healthcare

This paper introduces MENTAT—an evaluation dataset designed and annotated by 9 U.S. psychiatrists, comprising 203 base questions (each with 5 answer choices) expanded via demographic variable substitution, covering 5 clinical practice domains: diagnosis, treatment, triage, monitoring, and documentation. By systematically substituting patient age, race, and gender, the benchmark evaluates decision-making bias across 22 language models, revealing significant and unpredictable accuracy disparities along demographic dimensions.

NeuroCircuitry-Inspired Hierarchical Graph Causal Attention Networks for Explainable Depression Identification

This paper proposes the NH-GCAT framework, which explicitly incorporates neuroscience priors on depression-related neural circuits into a GNN, modeling brain activity at three spatial scales—region, circuit, and network. The method achieves state-of-the-art classification on the REST-meta-MDD dataset (AUC 78.5%, ACC 73.8%) and provides interpretable analyses consistent with established neuroscientific findings.

Omni-iEEG: A Large-Scale, Comprehensive iEEG Dataset and Benchmark for Epilepsy Research

This paper introduces the Omni-iEEG dataset (302 patients, 178 hours of high-resolution intracranial EEG recordings), defines standardized benchmark tasks and evaluation metrics grounded in clinical priors, and demonstrates that end-to-end modeling can match or surpass traditional biomarker-based approaches for epilepsy surgical planning.

Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling

This paper theoretically identifies two fundamental flaws in existing length-penalty approaches—incorrectly penalizing high-entropy exploration tokens and erroneously rewarding redundant tokens—and proposes the DeCS framework. Through decoupled token-level rewards and curriculum batch scheduling, DeCS reduces reasoning tokens by over 50% across 7 benchmarks while maintaining or even improving model performance.

Protein as a Second Language for LLMs

This work treats amino acid sequences as a "second language" for LLMs. By constructing a protein–natural language bilingual dataset and an adaptive context construction mechanism, the proposed framework enables general-purpose LLMs to achieve an average ROUGE-L improvement of 7%—up to 17.2%—on protein question-answering tasks without any training, even surpassing domain-specific fine-tuned models.

Protein Counterfactuals via Diffusion-Guided Latent Optimization

This paper proposes MCCOP, a framework that performs gradient-guided counterfactual optimization in a continuous joint sequence–structure latent space, using a pretrained diffusion model as a manifold prior. With as few as 2–3 mutations, MCCOP generates biologically plausible protein variants that flip predictor outputs, simultaneously enabling model interpretation and protein design hypothesis generation.

Protein Structure Tokenization via Geometric Byte Pair Encoding

This paper proposes GeoBPE — the first tokenizer to extend Byte Pair Encoding (BPE) from discrete text to continuous protein backbone geometry. By alternating between local merging (k-medoids clustering + quantization) and global correction (differentiable inverse kinematics), GeoBPE constructs a hierarchical structural motif vocabulary that achieves >10× compression ratio and >10× data efficiency over VQ-VAE-based protein structure tokenizers (PSTs), ranking first across 24 test sets spanning 12 downstream tasks.

Q-FSRU: Quantum-Augmented Frequency-Spectral Fusion for Medical Visual Question Answering

This paper proposes the Q-FSRU framework, which transforms medical image and text features into the frequency domain via FFT for fusion, and introduces a quantum-inspired retrieval augmentation mechanism (Quantum RAG) to retrieve medical facts from an external knowledge base, achieving 90.0% accuracy on the VQA-RAD dataset.

Resp-Agent: An Agent-Based System for Multimodal Respiratory Sound Generation and Disease Diagnosis

This paper proposes Resp-Agent, a closed-loop multi-agent framework that coordinates a controllable respiratory sound generator and a multimodal diagnoser via an active adversarial curriculum planner (Thinker-A2CA). Built upon a 229k-scale benchmark, the system achieves co-design of generation and diagnosis, substantially improving diagnostic performance on long-tail categories.

Reverse Distillation: Consistently Scaling Protein Language Model Representations

To address the anomalous scaling phenomenon in protein language models (PLMs) where larger models do not necessarily yield better performance, this paper proposes a reverse distillation framework. It uses the representations of a smaller model as a base, extracts orthogonal residual information from a larger model via SVD, and constructs Matryoshka nested embeddings—ensuring that larger reverse-distilled models consistently outperform smaller ones. ESM-2 15B, after reverse distillation, becomes for the first time the strongest model in its family.

Scalable Spatio-Temporal SE(3) Diffusion for Long-Horizon Protein Dynamics

This paper proposes STAR-MD, an SE(3)-equivariant causal diffusion Transformer that achieves microsecond-scale protein dynamics trajectory generation via joint spatio-temporal attention and contextual noise perturbation. STAR-MD attains state-of-the-art performance across all metrics on the ATLAS benchmark and stably extrapolates to microsecond timescales unseen during training.

Scaling with Collapse: Efficient and Predictable Training of LLM Families

This paper demonstrates that the training loss curves (TLCs) of LLM families "collapse" onto a single universal curve when optimization hyperparameters are matched to the data budget, and leverages this phenomenon for two practical applications: (1) deviation from collapse as an early diagnostic signal for training pathologies, and (2) the predictability of the collapse curve enabling early stopping for large-scale hyperparameter tuning.

Shoot First, Ask Questions Later? Building Rational Agents that Explore and Act Like People

This paper introduces the Collaborative Battleship task to evaluate the information-seeking capabilities of language models, and proposes three Bayesian inference strategies (Bayes-Q/M/D) to enhance LM questioning, action selection, and decision-making. The approach enables a weak model (Llama-4-Scout) to achieve superhuman performance (82% win rate) at approximately 1% the cost of GPT-5.

SONIC: Spectral Oriented Neural Invariant Convolutions

SONIC transfers the core idea of state space models to the multi-dimensional frequency domain, defining a set of orientation-selective spectral transfer functions using 6 continuous parameters (amplitude, orientation, damping, oscillation, etc.), and mixing across channels via low-rank matrices \(B\) and \(C\). This yields a drop-in convolutional replacement operator that inherently possesses a global receptive field and resolution invariance. On 3D medical segmentation, it matches nnU-Net with nearly two orders of magnitude fewer parameters, and is also competitive on ImageNet.

SurvHTE-Bench: A Benchmark for Heterogeneous Treatment Effect Estimation in Survival Analysis

This paper introduces SurvHTE-Bench, the first comprehensive benchmark for heterogeneous treatment effect (HTE) estimation on right-censored survival data, encompassing 40 synthetic datasets, 10 semi-synthetic datasets, and 2 real-world datasets. It systematically evaluates 53 estimation methods under varying causal assumption violations and censoring levels, finding that no single method dominates, and that survival meta-learners—particularly S-Learner-Survival and Matching-Survival—are most robust under high censoring and assumption violations.

SynCoGen: Synthesizable 3D Molecule Generation via Joint Reaction and Coordinate Modeling

SynCoGen proposes a multimodal generative framework combining masked graph diffusion and flow matching to jointly sample molecular building-block reaction graphs and 3D atomic coordinates, achieving high-quality 3D molecule generation while guaranteeing synthetic feasibility.

Thompson Sampling via Fine-Tuning of LLMs

This paper proposes ToSFiT, which extends Thompson Sampling to large-scale unstructured discrete spaces by fine-tuning large language models to directly parameterize the Probability of Maximality (PoM), thereby circumventing the intractability of acquisition function maximization.

Tracing Pharmacological Knowledge in Large Language Models

The first systematic causal analysis of the encoding mechanisms for drug-group semantics in biomedical LLMs, revealing that drug-group knowledge is stored in early layers, distributed across multiple tokens (not the last token alone), and that linearly separable semantic information is already present at the embedding layer.

Ultra-Fast Language Generation via Discrete Diffusion Divergence Instruct

This paper proposes DiDi-Instruct, a distillation framework based on Integrated KL Divergence (IKL) minimization that compresses a pretrained diffusion large language model (dLLM) into a few-step student model. Through four key designs—adversarial density ratio estimation, grouped reward normalization, score decomposition, and a Reward-Guided Ancestral Sampler (RGAS)—the student model surpasses the 1024-step teacher's perplexity on OpenWebText using only 16 steps, achieving up to 64× inference speedup at a training cost of just 1 GPU hour.

Unified Biomolecular Trajectory Generation via Pretrained Variational Bridge

PVB (Pretrained Variational Bridge) unifies the training objectives of single-structure pretraining and paired-trajectory fine-tuning via an encoder-decoder architecture combined with augmented bridge matching, enabling cross-domain biomolecular trajectory generation. It further accelerates protein–ligand holo-state exploration through RL fine-tuning based on adjoint matching.