💬 ACL2026 Paper Notes¶

681 ACL2026 paper notes covering Multimodal VLM (52), LLM Evaluation (45), LLM Reasoning (45), Model Compression (45), Information Retrieval & RAG (44), LLM Agent (44), Medical Imaging (42), LLM / NLP (38) and other 39 areas. Each note has TL;DR, motivation, method, experiments, highlights, and limitations — 5-minute reads of core ideas.

🧩 Multimodal VLM¶

A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends: This paper presents a systematic survey of Multimodal Large Language Model (MLLM)-based Visually Rich Document Understanding (VRDU), organizing OCR-based and OCR-free methods along two dimensions—feature representation/fusion and training paradigms—while discussing emerging directions such as data scarcity, multi-page documents, multilingual support, RAG, and agent-based frameworks.
Addressing Overthinking in Large Vision-Language Models via Gated Perception-Reasoning Optimization: This paper proposes the GPRO framework, which addresses the overthinking problem in LVLMs by inserting a meta-reasoning controller that dynamically routes computation at each token generation step to one of three paths (fast / perception re-examination / reasoning reflection), simultaneously improving both accuracy and efficiency.
AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis: This paper proposes AICA-Bench, a comprehensive benchmark covering three dimensions — Emotion Understanding (EU), Emotion Reasoning (ER), and Emotion-Guided Content Generation (EGCG) — to evaluate 23 VLMs. The evaluation reveals two systematic deficiencies: intensity calibration failure and shallow description, and introduces a training-free framework, GAT Prompting, to mitigate these issues.
All Changes May Have Invariant Principles: Improving Ever-Shifting Harmful Meme Detection via Design Concept Reproduction: This paper proposes RepMD, a method that constructs a Design Concept Graph (DCG)—inspired by attack trees to model the steps and logic behind malicious meme creation—to guide MLLMs in detecting ever-shifting harmful memes, achieving 81.1% accuracy on GOAT-Bench.
Automatic Slide Updating with User-Defined Dynamic Templates and Natural Language Instructions: This paper formalizes a novel task of dynamic slide updating on user-defined templates guided by natural language instructions, constructs the DynaSlide benchmark comprising 20,036 instruction-execution triplets, and proposes SlideAgent as a strong reference baseline.
Benchmarking Deflection and Hallucination in Large Vision-Language Models: This paper proposes VLM-DeflectionBench, a multimodal benchmark comprising 2,775 samples that systematically evaluates the deflection vs. hallucination behavior of large vision-language models (LVLMs) under insufficient or misleading evidence, through four evaluation scenarios (Parametric / Oracle / Realistic / Adversarial). Experiments covering 20 state-of-the-art LVLMs reveal that virtually no model can reliably deflect under noisy evidence.
CArtBench: Evaluating Vision-Language Models on Chinese Art Understanding, Interpretation, and Authenticity: This paper introduces CArtBench — a multi-task benchmark grounded in the Palace Museum collection — to evaluate VLMs across four capabilities in Chinese art understanding (evidence-anchored QA, structured connoisseurship, defensible reinterpretation, and authenticity verification). Even the strongest models exhibit significant performance drops in evidence association and style-period reasoning, while authenticity verification approaches random-chance performance.
CogGen: A Cognitively Inspired Recursive Framework for Deep Research Report Generation: CogGen proposes a multi-agent recursive framework that simulates the human cognitive writing process. It achieves global restructuring via a macro-cognitive loop, parallel section refinement via a micro-cognitive cycle, and semantic-level text–chart co-planning via Abstract Visual Representation (AVR). On the OWID benchmark, CogGen reaches human expert-level performance and surpasses Gemini Deep Research.
Collaborative Multi-Agent Scripts Generation for Enhancing Imperfect-Information Reasoning in Murder Mystery Games: This paper proposes a collaborative multi-agent framework for automatically generating high-quality murder mystery game scripts and training data. Through a two-stage training strategy (CoT fine-tuning + GRPO reinforcement learning with ScoreAgent reward shaping), it enhances VLM multi-hop reasoning under imperfect information, achieving significant improvements on WhodunitBench in narrative reasoning, fact extraction, and deception resistance.
Doc-PP: Document Policy Preservation Benchmark for Large Vision-Language Models: This paper proposes the Doc-PP benchmark, revealing a "reasoning-induced safety gap" in large vision-language models (LVLMs) during multimodal document question answering—models bypass explicit non-disclosure policies and leak sensitive information when cross-modal reasoning is required. A structured reasoning framework, DVA (Decompose–Verify–Aggregation), is proposed to substantially reduce leakage rates.
Don't Act Blindly: Robust GUI Automation via Action-Effect Verification and Self-Correction: This paper proposes VeriGUI, a framework that incorporates a Thinking-Verification-Action-Expectation (TVAE) closed-loop reasoning mechanism and a two-stage training pipeline (Robust SFT + GRPO), enabling GUI agents to verify whether each action succeeds and self-correct upon failure. VeriGUI achieves substantial improvements over baselines at both 3B and 7B scales.
Dynamic Emotion and Personality Profiling for Multimodal Deception Detection: This paper identifies that existing deception detection datasets provide only participant-level emotion/personality labels (all samples from the same subject share identical labels), and proposes a sample-level dynamic annotation scheme along with a reliability-weighted multimodal fusion framework, Rel-DDEP, achieving improvements of 2.53% in deception detection F1, 2.66% in emotion detection F1, and 9.30% in personality detection F1.
Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques, and Prospects: This paper proposes a systematic taxonomy for efficient inference in large vision-language models (LVLMs), analyzing bottlenecks across an encode–prefill–decode three-stage pipeline. It identifies a systemic efficiency barrier caused by visual token dominance and presents a comprehensive map of optimization techniques spanning information density shaping, long-context attention management, and memory bandwidth breakthroughs.
Enhancing Multimodal Large Language Models for Ancient Chinese Character Evolution Analysis via Glyph-Driven Fine-Tuning: This paper constructs a benchmark for ancient Chinese character evolution analysis comprising 11 tasks and 130,000+ instances, evaluates 19 MLLMs to reveal their limited capacity for glyph-level recognition and evolution reasoning, and proposes GEVO—a glyph-driven contrastive fine-tuning framework—that achieves consistent improvements across all tasks on a 2B-scale model.
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection: This paper formally defines the multimodal error detection task and constructs the ErrorRadar benchmark — comprising 2,500 K-12 multimodal math problems drawn from real student responses — to evaluate MLLMs on two subtasks: error step identification (STEP) and error type classification (CATE). The strongest model, GPT-4o, still lags behind human evaluators by approximately 10–15%.
Faithful-First Reasoning, Planning, and Acting for Multimodal LLMs: This paper proposes the Faithful-First RPA framework, which employs the FaithEvi pipeline to evaluate perceptual faithfulness at each reasoning step (i.e., whether claimed objects genuinely exist in the image), and the FaithAct mechanism to enforce evidence-grounded planning and acting during reasoning generation. The framework improves perceptual faithfulness by up to 24% without degrading task accuracy.
FineSteer: A Unified Framework for Fine-Grained Inference-Time Steering in Large Language Models: FineSteer decomposes inference-time steering into two complementary stages: Subspace-guided Conditional Steering (SCS) determines when to steer — using the subspace energy ratio of IR queries as a gate; Mixture of Steering Experts (MoSE) determines how to steer — dynamically aggregating prototype experts via an attention gating network with residual refinement to produce query-specific steering vectors. The framework surpasses SOTA on both safety and truthfulness benchmarks.
From Heads to Neurons: Causal Attribution and Steering in Multi-Task Vision-Language Models: This paper proposes HONES, a framework that first localizes task-critical attention heads and then uses them as conditions to guide FFN neuron attribution, achieving unified, gradient-free, neuron-level causal analysis and lightweight task performance improvement across heterogeneous tasks in multi-task VLMs.
From Inheritance to Saturation: Disentangling the Evolution of Visual Redundancy for Architecture-Aware MLLM Inference Acceleration: This work identifies two sources of visual redundancy in MLLM inference — inherent visual redundancy (IVR) arising from dense ViT tokenization, and secondary saturation redundancy (SSR) emerging from deep-layer semantic saturation whose manifestation varies across backbone architectures — and proposes the HalfV framework to address each type separately, achieving a 4.1× FLOPs speedup on Qwen2.5-VL while retaining 96.8% of performance.
From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck: This paper proposes MM-Mem, a pyramidal multimodal memory architecture inspired by Fuzzy Trace Theory (FTT). The memory is organized into three hierarchical layers — a Sensory Buffer (vision-dominant), an Episodic Stream (event-level summaries), and a Symbolic Schema (knowledge graph) — and achieves SOTA performance on 4 long-video benchmarks by compressing redundancy bottom-up via SIB-GRPO (Semantic Information Bottleneck + RL) and retrieving top-down via entropy-driven adaptive depth selection.
GAMBIT: A Gamified Jailbreak Framework for Multimodal Large Language Models: This paper proposes GAMBIT, a gamified multimodal jailbreak framework that decomposes harmful queries into puzzle images with hidden keywords and embeds them within competitive game scenarios. By exploiting the model's reasoning incentives and cognitive load, GAMBIT bypasses safety filters, achieving attack success rates of 92.13% on Gemini 2.5 Flash and 85.87% on GPT-4o, and is effective against both reasoning and non-reasoning models.
GeoRC: A Benchmark for Geolocation Reasoning Chains: This paper introduces GeoRC, the first geolocation reasoning chain benchmark authored by GeoGuessr champion-level experts (800 reasoning chains, 500 scenes), designed to evaluate VLMs' ability to generate auditable reasoning chains. Findings reveal that while closed-source VLMs can match human-level localization accuracy, their reasoning chain quality remains substantially inferior, and open-source VLMs perform nearly on par with a pure hallucination baseline.
HiPrune: Hierarchical Attention for Efficient Token Pruning in Vision-Language Models: This paper identifies a hierarchical attention pattern in vision encoders—middle layers attend to foreground objects while deep layers capture global information—and proposes HiPrune, a training-free, model-agnostic visual token pruning method. By selecting three categories of tokens (Anchor/Buffer/Register) to preserve information at different semantic levels, HiPrune retains 99.3% of performance using only 1/3 of the tokens while reducing FLOPs by 58.7%.
LaMI: Augmenting Large Language Models via Late Multi-Image Fusion: This paper proposes LaMI, a late-fusion architecture that integrates visual features with LLM outputs at the final prediction stage, and at inference time generates multiple images from input text for confidence-weighted aggregation. LaMI significantly enhances the visual commonsense reasoning capability of LLMs without compromising their text reasoning performance.
Leave My Images Alone: Preventing Multi-Modal Large Language Models from Analyzing Unauthorized Images: This paper proposes ImageProtector, which embeds nearly imperceptible adversarial perturbations into images as a visual prompt injection attack, causing MLLMs to generate refusal responses when analyzing protected images. This prevents malicious actors from exploiting open-weight MLLMs to extract private information from images at scale.
Making MLLMs Blind: Adversarial Smuggling Attacks in MLLM Content Moderation: This paper exposes the threat of Adversarial Smuggling Attacks (ASA) in MLLM-based content moderation—encoding harmful content into visually human-readable but AI-imperceptible formats to evade automated detection. The authors construct SmuggleBench, a benchmark of 1,700 samples spanning 9 attack techniques, and demonstrate that SOTA models including GPT-5 achieve attack success rates exceeding 90%.
MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems: This work proposes the FlowVerse benchmark—which decomposes mathematical problem information into four components (DI/EI/RP/OQ) and constructs six variant versions—and the MathFlow modular pipeline, which decouples perception and reasoning into independent stages. A dedicated perception model, MathFlow-P-7B, is trained to extract critical information from mathematical diagrams, substantially improving visual mathematical problem-solving performance across diverse reasoning models.
MedLayBench-V: A Large-Scale Benchmark for Expert-Lay Semantic Alignment in Medical Vision Language Models: This paper presents MedLayBench-V, the first large-scale multimodal medical expert-lay semantic alignment benchmark (79,793 image-text pairs). Through a Structured Concept-Grounded Refinement (SCGR) pipeline, professional radiology reports are transformed into lay descriptions, reducing reading difficulty from graduate level to high school level while preserving clinical semantic fidelity. Zero-shot retrieval experiments demonstrate that lay descriptions incur less than 1% performance degradation.
Mitigating Hallucinations in Large Vision-Language Models without Performance Degradation: This paper proposes the MPD framework, which decouples hallucination components via semantics-aware orthogonal subspace projection and selectively updates only the parameters most relevant to hallucinations. MPD reduces hallucinations by 23.4% while preserving 97.4% of general generation capability, without introducing any additional inference overhead.
MMErroR: A Benchmark for Erroneous Reasoning in Vision-Language Models: This paper presents MMErroR, a multimodal erroneous reasoning benchmark comprising 1,997 samples, each containing exactly one deliberately injected reasoning error, spanning 6 domains and 4 error types. The benchmark requires VLMs to not only detect the presence of errors in reasoning chains but also classify the error type (Visual Perception Error / Knowledge Deficiency Error / Question Comprehension Error / Reasoning Error). Evaluation of 12 representative VLMs reveals that even the strongest model, Gemini-3-Pro-Preview, achieves only 66.65% accuracy.
More Than Meets the Eye: Measuring the Semiotic Gap in Vision-Language Models via Semantic Anchorage: This paper exposes the "literal superiority bias" in VLMs from a cognitive-semiotic perspective—models tend toward literal rather than metaphorical/idiomatic interpretations of high-fidelity images. By introducing the DIVA benchmark (iconographically abstracted images) and the Semantic Alignment Gap metric, the paper demonstrates that reducing visual fidelity significantly narrows the gap between literal and idiomatic interpretations.
Multi-Task Reinforcement Learning for Enhanced Multimodal LLM-as-a-Judge: This paper proposes MT-RL-Judge, a multi-task reinforcement learning framework that jointly optimizes multiple evaluation tasks via GRPO to train a unified MLLM-as-a-Judge model. The framework consistently outperforms SFT baselines across six benchmarks covering text-image alignment, safety compliance, and visual quality assessment, and demonstrates strong out-of-distribution generalization on the unseen MJ-Bench pairwise comparison format (82.23% on Safety vs. 49.40% for SFT-Unified).
OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Models: This paper presents OMIBench — the first large-scale benchmark for olympiad-level multi-image reasoning, covering 1,000+ competition problems across biology, chemistry, mathematics, and physics. Even the strongest LVLM (Gemini-3-Pro) achieves only ~50% accuracy, a drop of more than 25% compared to single-image benchmarks.
Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval: This paper proposes OEA (Omni-Embed-Audio), which employs a multimodal LLM as a unified encoder to construct a retrieval-oriented audio-text embedding space. It introduces the User-Intent Queries (UIQ) benchmark and hard-negative discrimination metrics (HNSR/TFR), demonstrating that the LLM backbone significantly outperforms CLAP-based methods on T2T retrieval (+22%) and hard negative discrimination (+4.3%p HNSR@10).
Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning: This position paper argues that Multimodal Large Language Models (MLLMs) can significantly advance cross-disciplinary scientific reasoning. It proposes a four-stage research roadmap (broad knowledge recognition → analogical reasoning & generalization → insightful reasoning → creative hypothesis generation), and systematically surveys the current state of MLLM applications across mathematics, physics, chemistry, and biology, identifying five major challenges and eight future directions.
Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring: This paper proposes the Representational Contrastive Scoring (RCS) framework, which analyzes the geometric structure of intermediate-layer representations within LVLMs. By learning a lightweight projection and applying contrastive scoring, RCS distinguishes malicious intent from benign distributional shift, achieving state-of-the-art jailbreak detection performance under a rigorous cross-attack-type generalization evaluation protocol.
SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models: This paper proposes the SafetyALFRED benchmark, which introduces six categories of kitchen safety hazards into the ALFRED embodied task setting. It reveals a critical alignment gap: multimodal large language models can identify hazards in static QA (up to 92%) but fail to proactively mitigate them during embodied planning (<60%), advocating a paradigm shift from QA-based to embodied safety evaluation.
Seeing No Evil: Blinding Large Vision-Language Models to Safety Instructions via Adversarial Attention Hijacking: This paper proposes Attention-Guided Visual Jailbreaking, which bypasses—rather than directly confronts—safety alignment mechanisms by suppressing model attention to safety instructions and anchoring attention to adversarial image features. The method achieves a 94.4% attack success rate (ASR) on Qwen-VL while reducing gradient conflicts by 45%.
Spotlight and Shadow: Attention-Guided Dual-Anchor Introspective Decoding for MLLM Hallucination Mitigation: This paper proposes DaID (Dual-Anchor Introspective Decoding), which mitigates hallucinations within a single forward pass by exploiting layer-wise differences in visual perception within MLLMs — Spotlight layers amplify visual signals while Shadow layers suppress language priors.
Targeted Exploration via Unified Entropy Control for Reinforcement Learning: This paper proposes UEC-RL, a unified bidirectional entropy control framework that addresses entropy collapse and training instability in GRPO by performing high-temperature targeted exploration on difficult prompts (entropy increase) and consolidating high-quality trajectories via an experience replay stabilizer (entropy decrease), achieving a 37.9% relative improvement on Geometry3K.
TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval: This paper proposes TEMA (Text-oriented Entity Mapping Architecture), the first CIR framework designed for multi-modification text (MMT). It enhances entity coverage via a Parsing Assistant (PA), resolves clause-entity misalignment via an Entity Mapping (EM) module, and introduces two multi-modification benchmarks—M-FashionIQ and M-CIRR—achieving state-of-the-art performance in both standard and multi-modification settings.
Through the Magnifying Glass: Adaptive Perception Magnification for Hallucination-Free VLM Decoding: This paper proposes the Perception Magnifier (PM), a visual decoding method that, at each autoregressive decoding step, iteratively identifies critical visual regions based on multi-layer attention and adaptively magnifies them. By increasing the effective resolution of key regions, PM mitigates visual hallucinations in VLMs while preserving spatial structure and reasoning capability.
Topology-Aware Layer Pruning for Large Vision-Language Models: This paper proposes TopoVLM, a topology-aware layer pruning framework that models hidden states at each layer as point clouds and quantifies inter-layer topological consistency via zigzag persistent homology. The method adaptively retains critical representation-transition layers while removing structurally redundant ones, achieving significant improvements over existing pruning methods at 50–60% sparsity.
Tree-of-Evidence: Efficient "System 2" Search for Faithful Multimodal Grounding: This paper proposes Tree-of-Evidence (ToE), an inference-time discrete beam search algorithm that formalizes multimodal model interpretability as a discrete optimization problem over coarse-grained evidence units (vital sign time windows, radiology report segments). With only 5 evidence units, ToE retains over 98% of the full-input model's AUROC while generating auditable evidence trace paths.
TRACE: Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning: This paper proposes TRACE (Textual Representation of Allocentric Context from Egocentric Video), a prompting method that guides multimodal large language models to generate structured textual allocentric 3D environment representations from egocentric video—comprising meta context, camera trajectory, and an entity registry—as intermediate reasoning steps to enhance spatial question answering. TRACE consistently outperforms existing prompting strategies on both VSI-Bench and OST-Bench.
VLA-Forget: Vision-Language-Action Unlearning for Embodied Foundation Models: This paper proposes VLA-Forget, the first hybrid unlearning framework for vision-language-action (VLA) models. It employs ratio-aware selective editing for perception/cross-modal layers and significance-based selective editing for reasoning/action layers, achieving targeted behavior removal while improving perceptual specificity (+22%) and task success rate (+9%).
What's Missing in Screen-to-Action? Towards a UI-in-the-Loop Paradigm for Multimodal GUI Reasoning: This paper proposes UILoop (UI-in-the-Loop), a paradigm that restructures GUI reasoning from the conventional "screen→action" pipeline into a cyclic "screen→UI elements→action" process. Through UI element-driven reinforcement fine-tuning, the model is trained to explicitly locate, understand, and leverage key UI elements, achieving state-of-the-art performance on GUI reasoning benchmarks.
What Do Vision-Language Models Encode for Personalized Image Aesthetics Assessment?: Through linear probing, this paper demonstrates that VLM hidden representations encode rich, multi-level aesthetic attribute information (illumination, color, composition, etc.) that propagates into language decoder layers. Building on this finding, the authors propose a simple linear regression approach for personalized image aesthetics assessment (PIAA) that requires no fine-tuning, significantly outperforming few-shot and LoRA fine-tuning baselines.
When Helpers Become Hazards: A Benchmark for Analyzing Multimodal LLM-Powered Safety in Daily Life: This paper introduces SaLAD, a benchmark comprising 2,013 real image-text samples spanning 10 daily-life categories, designed to evaluate the ability of multimodal large language models to identify implicit safety risks and provide safety warnings during everyday assistance. Results reveal that even the strongest model achieves only 57.2% accuracy on unsafe queries.
When Slower Isn't Truer: Inverse Scaling Law of Truthfulness in Multimodal Reasoning: This paper identifies an "inverse scaling law" in multimodal reasoning models — slow-thinking (reasoning) models are more prone to producing untruthful outputs than fast-thinking (chat) models when faced with misleading visual inputs. To systematically diagnose this phenomenon, the authors construct the TruthfulVQA benchmark (5,000+ samples, 50 annotators, three-tier graded prompts) and the TruthfulJudge evaluation model (88.4% accuracy).
When Vision-Language Models Judge Without Seeing: Exposing Informativeness Bias: This paper exposes a severe informativeness bias in VLM-as-a-Judge systems—judges tend to favor more detailed and elaborate responses even when such responses contradict the image content. The proposed BIRCH paradigm first calibrates candidate answers against the image before comparison, reducing bias by up to 17% and improving performance by up to 9.8%.
WikiSeeker: Rethinking the Role of Vision-Language Models in Knowledge-Based Visual Question Answering: This paper proposes WikiSeeker, which redefines the role of VLMs in multimodal RAG—transforming them from mere answer generators into two specialized agents: a Refiner (trained with RL to rewrite queries) and an Inspector (to verify the reliability of retrieved contexts). WikiSeeker achieves state-of-the-art performance on three benchmarks: EVQA, InfoSeek, and M2KR.

📊 LLM Evaluation¶

Abstain-R1: Calibrated Abstention and Post-Refusal Clarification via Verifiable RL: Abstain-R1 proposes a clarification-aware RLVR reward that jointly optimizes explicit abstention and post-refusal clarification (identifying missing information) on unanswerable queries, enabling a 3B model to match or surpass large models such as DeepSeek-R1 on both abstention and clarification quality.
AnchorMem: Anchored Facts with Associative Contexts for Building Memory in Large Language Models: This paper proposes AnchorMem, a memory framework inspired by the Proustian phenomenon in cognitive science. It decouples retrieval units (atomic facts) from generation contexts (original interactions) and connects fragmented memories via an associative event graph, achieving substantial improvements over existing memory systems such as A-Mem and Mem0 on the LoCoMo benchmark.
Are They Lovers or Friends? Evaluating LLMs' Social Reasoning in English and Korean Dialogues: This paper introduces SCRIPTS, a benchmark comprising 1.1K English and Korean movie dialogues, evaluating the social relationship reasoning capabilities of 9 LLMs via a three-tier probabilistic labeling scheme (HIGHLY LIKELY / LESS LIKELY / UNLIKELY). Results show that models achieve only 75–80% accuracy on English and 58–69% on Korean, with Chain-of-Thought prompting and reasoning models providing little to no benefit for social reasoning.
Attribution, Citation, and Quotation: A Survey of Evidence-based Text Generation with Large Language Models: This paper presents a systematic survey of 134 papers on evidence-based text generation with LLMs, proposing for the first time a unified taxonomy (attribution approach × citation characteristics × task), analyzing 300 evaluation metrics organized into seven dimensions and six method categories, and providing a panoramic reference framework for this fragmented field.
AutoReproduce: Automatic AI Experiment Reproduction with Paper Lineage: AutoReproduce proposes a multi-agent framework that mines implicit domain knowledge from cited references via a "Paper Lineage" algorithm, enabling end-to-end automatic reproduction of paper experiments. On the self-constructed benchmark ReproduceBench, it achieves a code execution rate of 94.87% with a performance gap of only 19.72%.
Beyond Reproduction: A Paired-Task Framework for Assessing LLM Comprehension and Creativity in Literary Translation: This paper proposes a paired-task framework for jointly evaluating LLMs' literary text comprehension and translational creativity, conducting a large-scale benchmark of 23 models across 11 classic English novels, and finding that strong comprehension ability does not transfer to human-level translational creativity.
BizCompass: Benchmarking the Reasoning Capabilities of LLMs in Business Knowledge and Applications: This paper introduces BizCompass, a business reasoning benchmark bridging theoretical foundations and practical applications. It covers four knowledge domains (finance, economics, statistics, and operations management) and three application roles (analyst, trader, and consultant), systematically evaluating the business reasoning capabilities of both open-source and closed-source LLMs, and revealing how theoretical knowledge transfers to real-world performance.
Capabilities and Evaluation Biases of Large Language Models in Classical Chinese Poetry Generation: A Case Study on Tang Poetry: This paper proposes a three-step evaluation framework (computational feature extraction + LLM-as-Judge + human expert validation) to systematically assess the Tang poetry generation capabilities of six LLMs. A critical "echo chamber" effect is identified: LLMs systematically overrate machine-generated poems that mimic statistical patterns while violating prosodic rules, diverging significantly from human expert judgments.
CAST: Achieving Stable LLM-based Text Analysis for Data Analytics: This paper proposes the CAST framework, which constrains the latent reasoning trajectories of LLMs through two complementary mechanisms—Algorithmic Prompting and Thinking-before-Speaking—to significantly improve run-to-run stability in text summarization and annotation tasks without sacrificing output quality.
Closing the Modality Reasoning Gap for Speech Large Language Models: This paper proposes TARS (Trajectory Alignment for Reasoning in Speech), a reinforcement learning-based framework that aligns speech-conditioned reasoning trajectories with text-conditioned ones via two dense reward signals—representation alignment and behavior alignment. TARS achieves state-of-the-art performance at the 7B scale, with a Modality Recovery Rate (MRR) approaching or exceeding 100%.
Common to Whom? Regional Cultural Commonsense and LLM Bias in India: This paper introduces Indica, the first benchmark for evaluating LLM performance on sub-national cultural commonsense, focusing on cultural differences across five regions of India in eight domains of everyday life. Only 39.4% of questions reach consensus across all five regions, and all evaluated LLMs exhibit geographic bias—systematically over-selecting Central and North India as the "default" cultural representative.
Contrastive Decoding Mitigates Score Range Bias in LLM-as-a-Judge: This paper identifies score range bias in LLM judges under direct assessment settings — i.e., model outputs are highly sensitive to predefined score ranges — and proposes contrastive decoding as a mitigation strategy, leveraging the mutual cancellation of similar biases within the same model family, achieving an average relative improvement of up to 11.3% in Spearman correlation.
DiZiNER: Disagreement-guided Instruction Refinement via Pilot Annotation Simulation for Zero-shot Named Entity Recognition: DiZiNER simulates the pilot annotation workflow in human labeling pipelines by employing multiple heterogeneous LLMs as annotators and a supervisor LLM to analyze inter-model disagreements and iteratively refine task instructions. The method achieves zero-shot state-of-the-art on 14 out of 18 NER benchmarks, with an average improvement of +8.0 F1, and surpasses its own supervisor model, GPT-4o mini, without any parameter updates.
Do LLMs Overthink Basic Math Reasoning? Benchmarking the Accuracy-Efficiency Tradeoff: This paper presents LLMThinkBench, a benchmark for systematically evaluating the efficiency of LLMs on basic mathematical reasoning. It introduces the Overthinking Score — a harmonic mean of accuracy and token efficiency — and evaluates 53 LLMs across 14 deterministically generated math tasks. Results show that reasoning models generate on average ~18× more tokens yet sometimes achieve lower accuracy, and that scaling inference budgets yields diminishing returns.
E2EDev: Benchmarking Large Language Models in End-to-End Software Development Task: This paper proposes E2EDev, an end-to-end software development benchmark grounded in Behavior-Driven Development (BDD) principles. It comprises 46 real-world web projects, 244 fine-grained requirements, and 703 executable BDD tests. Evaluation reveals that even the strongest LLMs (Claude series) achieve no more than 60% requirement accuracy, and that the interaction overhead of multi-agent frameworks is disproportionate to their performance gains.
Enhancing Linguistic Competence of Language Models through Pre-training with Language Learning Tasks: L2T proposes a pre-training framework that mixes 14 language learning tasks spanning four linguistic granularities (character → discourse) with standard next-token prediction. At the 500M/1B parameter scale, it improves BLiMP linguistic competence scores by 2–3 percentage points and accelerates their acquisition, while preserving general reasoning performance.
Exploring the Capability Boundaries of LLMs in Mastering of Chinese Chouxiang Language: This paper introduces Chinese internet subculture language "Chouxiang" (抽象话) to the NLP community, constructs the first evaluation benchmark Mouse — comprising six tasks: translation (TR), representation classification (RC), intent recognition (IR), toxicity detection (TD), meaning selection (MS), and cloze completion (CC) — and finds that state-of-the-art LLMs perform reasonably well on contextual semantic understanding but exhibit significant limitations across other tasks.
Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows: This paper introduces Finch (FinWorkBench), a finance and accounting workflow benchmark constructed from real enterprise environments (e.g., the Enron dataset), comprising 172 composite workflows and 1,710 spreadsheets (27 million cells). Even the strongest model, GPT 5.1 Pro, spending an average of 16.8 minutes per workflow, passes only 38.4% of the workflows, revealing critical gaps in frontier AI agents under realistic enterprise conditions.
From Domains to Instances: Dual-Granularity Data Synthesis for LLM Unlearning: This paper formally defines two granularities of LLM unlearning—domain-level and instance-level—and proposes the BiForget framework. Rather than relying on external strong models, BiForget leverages the target model itself to construct high-quality forget datasets via two stages: seed-guided synthesis and adversarial probing. On the Harry Potter domain, it improves relevance by ~20 and diversity by ~0.05 while halving the data volume.
HiGMem: A Hierarchical and LLM-Guided Memory System for Long-Term Conversational Agents: This paper proposes HiGMem, a two-level event-turn memory system that enables an LLM to first browse event summaries and then predict which fine-grained conversation turns are worth reading, achieving the best F1 on four out of five question categories on the LoCoMo10 benchmark while retrieving an order of magnitude fewer turns.
Idiom Understanding as a Tool to Measure the Dialect Gap: Three new French idiom understanding benchmark datasets are proposed — QFrCoRE and QFrCoRT for Quebec French, and MFrCoE for standard French. Evaluation across 111 LLMs reveals that 65.77% of models perform significantly worse on dialectal idioms than on standard French idioms, quantifying the dialect gap phenomenon.
Language Model as Planner and Formalizer under Constraints: This paper introduces the CoPE benchmark, which injects formally categorized natural language constraints into classical planning environments, revealing that a single constraint sentence can halve the planning performance of state-of-the-art LLMs, exposing critical deficiencies in LLM planning robustness.
LexRel: Benchmarking Legal Relation Extraction for Chinese Civil Cases: This work introduces the first structured taxonomy of legal relations in Chinese civil law (9 domains, 265 relation types) and presents LexRel, a benchmark comprising 1,140 expert-annotated instances. The benchmark is used to evaluate leading LLMs on legal relation extraction, revealing significant limitations of current models on this task, while also demonstrating that incorporating legal relation information yields consistent gains on downstream legal AI tasks.
MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification: This paper introduces MADE—a "living" multi-label text classification benchmark built on FDA medical device adverse event reports, featuring 1,154 hierarchical labels and strict temporal splits. It systematically evaluates 20+ encoder/decoder models across discriminative fine-tuning, generative fine-tuning, and few-shot prompting paradigms, assessing both predictive performance and uncertainty quantification (UQ) capabilities. Key findings reveal critical trade-offs: small discriminatively fine-tuned decoders achieve the best head-to-tail accuracy; generative fine-tuning yields the most reliable UQ; and large reasoning models improve rare-label performance but exhibit surprisingly weak UQ.
Min-k Sampling: Decoupling Truncation from Temperature Scaling via Relative Logit Dynamics: Min-k Sampling detects "semantic cliffs" — the boundary between high-confidence candidate tokens and low-quality tail noise — by analyzing the local structure of the sorted logit distribution. This yields strictly temperature-invariant truncation that maintains robust performance on reasoning and creative writing tasks even under extreme temperatures.
Mitigating Catastrophic Forgetting in Target Language Adaptation of LLMs via Source-Shielded Updates: This paper proposes Source-Shielded Updates (SSU), a column-wise freezing strategy driven by source-data parameter importance scoring. During continual pre-training (CPT) using only unlabeled target-language data, SSU reduces source-language performance degradation from 20.3% (full fine-tuning) to 3.4%, while maintaining target-language performance on par with or superior to full fine-tuning.
Modeling Multi-Dimensional Cognitive States in Large Language Models under Cognitive Crowding: This paper identifies that LLMs suffer a dramatic accuracy drop to 5.7% when jointly predicting four cognitive dimensions—sentiment, thinking style, stance, and intent—a phenomenon termed "cognitive crowding." Through Gromov \(\delta\)-hyperbolicity analysis, the paper demonstrates that cognitive states exhibit hierarchical structure, and proposes HyCoLLM, a framework that models cognitive states in hyperbolic space. An 8B model trained under this framework surpasses GPT-4o.
MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models: This paper proposes MTR-DuplexBench, a comprehensive multi-round evaluation benchmark for full-duplex speech language models (FD-SLMs). By introducing a novel turn segmentation method, it addresses the challenges of ambiguous turn boundaries and context inconsistency inherent in full-duplex dialogue. The benchmark covers four dimensions: conversational characteristics, dialogue quality, instruction following, and safety. Experiments reveal a consistent performance degradation of existing FD-SLMs across multi-round interactions.
MultiFileTest: A Multi-File-Level LLM Unit Test Generation Benchmark and Impact of Error Fixing Mechanisms: This paper introduces MultiFileTest, the first multi-file-level benchmark for LLM-based unit test generation, covering 20 projects each in Python, Java, and JavaScript. It evaluates 11 state-of-the-art LLMs and analyzes the impact of manual and self-repair mechanisms on test quality, revealing that even the strongest models produce substantial basic executability errors.
ODUTQA-MDC: A Task for Open-Domain Underspecified Tabular QA with Multi-turn Dialogue-based Clarification: This paper introduces the ODUTQA-MDC task and benchmark, the first systematic study of underspecified query detection and multi-turn dialogue-based clarification in open-domain tabular QA. The authors construct a large-scale dataset of 25,105 QA pairs and propose the MAIC-TQA multi-agent framework to perform end-to-end "detect–clarify–reason" tabular question answering.
PIArena: A Platform for Prompt Injection Evaluation: This paper presents PIArena, a unified and extensible evaluation platform for prompt injection (PI), integrating multiple state-of-the-art attack and defense methods with plug-and-play evaluation support. It introduces a strategy-based adaptive attack method and systematically exposes critical limitations of existing defenses in terms of generalization, resilience to adaptive attacks, and task-aligned injection scenarios.
ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition: This paper proposes ResearchBench, the first large-scale benchmark for evaluating LLMs in scientific discovery. Grounded in a theoretically motivated decomposition of inspiration-driven hypothesis generation, it covers 1,386 papers across 12 disciplines and decomposes scientific discovery into three sufficient subtasks: inspiration retrieval, hypothesis composition, and hypothesis ranking. Results show that LLMs perform surprisingly well on cross-disciplinary inspiration retrieval.
Rethinking Meeting Effectiveness: A Benchmark and Framework for Temporal Fine-grained Automatic Meeting Effectiveness Evaluation: This paper redefines meeting effectiveness evaluation by proposing an objective criterion of "goal achievement / time cost" and a temporal fine-grained evaluation paradigm. It constructs the AMI-ME dataset comprising 2,459 annotated segments from 130 meetings, and develops an LLM-based automatic evaluation framework achieving a Spearman correlation of 0.64.
ReTraceQA: Evaluating Reasoning Traces of Small Language Models in Commonsense Question Answering: This paper introduces ReTraceQA, the first reasoning process evaluation benchmark for commonsense question answering, comprising 2,421 instances annotated by domain experts with step-level error localization and error categorization. The benchmark reveals that 14–24% of SLMs produce correct answers via flawed reasoning, and that replacing answer-only evaluation with reasoning-aware evaluation reduces SLM performance by up to 25 percentage points.
Revisiting the Uniform Information Density Hypothesis in LLM Reasoning: This paper introduces the Uniform Information Density (UID) hypothesis from psycholinguistics into the analysis of LLM reasoning. It proposes an entropy-based, step-level information density measurement framework, revealing a counterintuitive pattern in high-quality reasoning trajectories characterized by local uniformity combined with global non-uniformity, and demonstrates that this pattern significantly outperforms conventional confidence/entropy baselines in Best-of-N sampling.
RoleConflictBench: A Benchmark of Role Conflict Scenarios for Evaluating LLMs' Contextual Sensitivity: RoleConflictBench constructs 13,914 role conflict scenarios and leverages situational urgency as an objective constraint to evaluate LLMs' contextual sensitivity, revealing that model decisions are dominated by static role preferences rather than dynamic contextual cues.
SciImpact: A Multi-Dimensional, Multi-Field Benchmark for Scientific Impact Prediction: This paper introduces SciImpact — the first large-scale scientific impact prediction benchmark spanning 19 disciplines and 7 impact dimensions (citations, awards, patents, media, code, datasets, and models), comprising 215,928 contrastive paper pairs. Multi-task fine-tuning enables a 4B model to outperform large models such as o4-mini.
Self-Awareness before Action: Mitigating Logical Inertia via Proactive Cognitive Awareness: This paper proposes SABA, a reasoning framework that adopts a "perceive before act" paradigm, explicitly constructing and auditing knowledge states prior to any final decision. It employs Information Fusion (IF) to consolidate narratives into a verifiable baseline state, and Query-driven Structured Reasoning (QSR) to recursively identify and resolve missing premises, achieving state-of-the-art performance on both detective reasoning and general reasoning benchmarks.
SessionIntentBench: A Multi-Task Inter-Session Intention-Shift Modeling Benchmark: This paper proposes SessionIntentBench, a multi-task benchmark for evaluating the ability of L(V)LMs to understand inter-session intention shifts in e-commerce shopping sessions. It comprises four progressively structured subtasks—intent-purchase likelihood estimation, attribute normalization, intent verification contrast, and intent evolution modeling—constructed from 1.9 million intent entries and 1.13 million intent trajectories. Experiments on 20+ L(V)LMs demonstrate that current models perform poorly at capturing complex session-level user intent.
Subject-level Inference for Realistic Text Anonymization Evaluation: SPIA introduces the first subject-level PII inference evaluation benchmark (675 documents, 1,712 subjects, 7,040 PII instances), revealing that even when 90%+ of PII spans are redacted, the subject-level inference protection rate can be as low as 33%, and that anonymization focused on a single target subject leads to greater exposure of non-target subjects.
Task-Aware LLM Routing with Multi-Level Task-Profile-Guided Data Synthesis for Cold-Start Scenarios: This paper proposes a multi-level task-profile-guided data synthesis framework to address the cold-start problem in LLM routing, and introduces TRouter—a routing method that treats task type as a latent variable—which models the query-cost-performance relationship via variational inference, achieving effective routing in both cold-start and in-domain settings.
Text-to-Distribution Prediction with Quantile Tokens and Neighbor Context: This paper proposes Quantile Token Regression, a method that inserts dedicated quantile tokens into the input sequence and incorporates retrieved neighbor instances along with their empirical distributions, enabling LLMs to predict full conditional distributions rather than single point estimates. The approach reduces MAPE by approximately 4 points over baselines and narrows prediction intervals by more than 2× on the Airbnb and StackSample datasets.
Think in Latent Thoughts: A New Paradigm for Gloss-Free Sign Language Translation: This paper proposes SignThought, a reasoning-driven gloss-free sign language translation framework that introduces learnable latent thought slots as an explicit intermediate semantic layer between video and text. A "plan-then-locate" dual-stream decoder decouples semantic planning from visual evidence retrieval, achieving state-of-the-art performance among gloss-free methods on multiple benchmarks.
TingIS: Real-time Risk Event Discovery from Noisy Customer Incidents at Enterprise Scale: TingIS is an end-to-end risk event discovery system deployed on a fintech platform. It employs a five-module architecture—semantic distillation, cascaded routing, event linking engine, state management, and multi-dimensional denoising—to extract actionable risk events from massive noisy customer complaints in real time, achieving a P90 alert latency of 3.5 minutes and a 95% high-priority event discovery rate.
Towards Self-Improving Error Diagnosis in Multi-Agent Systems: This paper proposes ErrorProbe, a framework that achieves self-improving semantic fault attribution in multi-agent systems through MAST taxonomy-driven structured decomposition, symptom-driven backward tracing, and a verified memory mechanism. The approach substantially outperforms baselines, particularly in step-level error localization.

💡 LLM Reasoning¶

AIM-CoT: Active Information-driven Multimodal Chain-of-Thought for Vision-Language Reasoning: This paper proposes AIM-CoT, a framework driven by Information Foraging Theory that addresses two core problems in Interleaved Multimodal Chain-of-Thought (I-MCoT)—what to see and when to see—through Active Visual Probing (AVP) based on information gain and a Dynamic Attention-shift Trigger (DAT) mechanism.
Budget-Aware Anytime Reasoning with LLM-Synthesized Preference Data: This paper proposes a budget-aware anytime reasoning framework and an Anytime Index metric to quantify the quality-efficiency trade-off of LLM reasoning under limited token budgets. It further introduces Preference Data Prompting (PDP), a test-time self-improvement method based on LLM-synthesized preference data, achieving substantial improvements in both intermediate and final solution quality across planning, mathematics, and science QA tasks.
Chain-of-Thought as a Lens: Evaluating Structured Reasoning Alignment between Human Preferences and Large Language Models: This paper proposes the Alignment Score — a semantic-level metric based on pairwise semantic entropy matrices — that quantifies reasoning alignment by comparing intermediate steps of model-generated chains-of-thought against human-preferred reference chains. The authors find that Alignment Score correlates strongly with task accuracy, readability, and coherence, and that 2-hop reasoning represents the peak depth for alignment.
Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models: This paper introduces OlymMATH, the first olympiad-level mathematical benchmark that unifies natural language evaluation and formal theorem proving. It comprises 350 bilingual (Chinese–English) problems, spanning OlymMATH-EASY/HARD (200 problems with numerical answers) and OlymMATH-LEAN (150 problems formalized in Lean 4). Experiments reveal that the strongest model achieves only 58.4% accuracy on the HARD subset.
CiPO: Counterfactual Unlearning for Large Reasoning Models through Iterative Preference Optimization: To address the unlearning challenge in large reasoning models (LRMs)—where sensitive knowledge must be removed from both chain-of-thought (CoT) reasoning and final answers simultaneously—this paper proposes the CiPO framework. CiPO instructs the model to generate logically valid counterfactual reasoning trajectories and employs iterative preference optimization to steer the model toward these counterfactual paths, achieving effective unlearning while preserving reasoning capability.
CRISP: Compressing Redundancy in Chain-of-Thought via Intrinsic Saliency Pruning: This paper proposes CRISP, a framework that identifies the attention pattern of the </think> token as a reliable indicator for distinguishing critical from redundant steps in reasoning chains. Building on this insight, CRISP designs a greedy-search compression pipeline with four atomic operators, reducing token usage by 50–60% while preserving accuracy.
Decoupling the Effect of Chain-of-Thought Reasoning: A Human Label Variation Perspective: Through Cross-CoT experiments and step-wise analysis, this paper reveals a "decoupling mechanism" underlying CoT reasoning: final accuracy is determined by CoT content (~99% variance contribution), whereas distributional ranking is dominated by the model's intrinsic prior (>80%). This demonstrates that long CoT is a strong decision-maker but a weak distribution calibrator.
Discovering a Shared Logical Subspace: Steering LLM Logical Reasoning via Alignment of Natural-Language and Symbolic Views: This work identifies a shared logical subspace within LLMs that simultaneously aligns natural-language and symbolic-logic reasoning representations. Steering activations along this subspace at inference time improves logical reasoning accuracy by up to 11 percentage points without any model training.
Dissecting Failure Dynamics in Large Language Model Reasoning: By analyzing LLM reasoning trajectories, this work finds that errors concentrate at a small number of critical turning points in the early stages, after which the model enters a "cognitive spiral"—continuously extending the reasoning in a locally coherent but globally erroneous manner. Based on these findings, the paper proposes the GUARD framework, which performs short-range branching repairs at high-risk turning points detected via entropy signals.
Do Not Step Into the Same River Twice: Learning to Reason from Trial and Error: This paper proposes LTE (Learning to reason from Trial and Error), which mitigates exploration stagnation in RLVR by using the model's own erroneous answers as hints to guide additional rollouts, without relying on any external expert supervision.
DPC: Training-Free Text-to-SQL Candidate Selection via Dual-Paradigm Consistency: DPC reframes Text-to-SQL candidate selection from "guessing over hidden data" to "deterministic verification over constructed data": it builds a Minimal Discriminative Database (MDD) that forces conflicting SQL candidates to produce different execution results, then uses Python/Pandas solutions as reference anchors to select the correct candidate via cross-paradigm consistency, outperforming Self-Consistency by up to 2.2% on BIRD and Spider.
Efficient PRM Training Data Synthesis via Formal Verification: This paper proposes FoVer, a framework that leverages formal verification tools (Z3 and Isabelle) to automatically annotate step-level correctness labels for reasoning chains in formal reasoning tasks. It constructs the FoVer-40K training set and fine-tunes a PRM, demonstrating formal-to-informal transfer capability and cross-task generalization across 12 reasoning benchmarks.
Efficient Process Reward Modeling via Contrastive Mutual Information: This paper proposes CPMI (Contrastive Pointwise Mutual Information), an efficient automatic step-level reward annotation method that estimates step-wise contributions by contrasting the conditional probability shifts a reasoning step induces on correct versus incorrect answers. Compared to Monte Carlo estimation, CPMI reduces construction time by 84% and token generation by 98%, while achieving higher accuracy on both process-level evaluation benchmarks and mathematical reasoning benchmarks.
Efficient Test-Time Scaling via Temporal Reasoning Aggregation: This paper proposes TRACE, a framework that determines reasoning convergence by aggregating two complementary signals within a sliding window — answer consistency across steps and confidence trajectory over time — enabling training-free dynamic early exit that reduces token usage by 25–30% with only a 1–2% accuracy drop.
Explicit Trait Inference for Multi-Agent Coordination: This paper proposes Explicit Trait Inference (ETI), a method that enables LLM agents to reason about and track partners' behavioral traits along the psychological dimensions of warmth and competence. ETI reduces payoff loss by 45–77% in economic games and improves task performance by 3–29% on MultiAgentBench.
Failure Modes in Multi-Hop QA: The Weakest Link Effect and the Recognition Bottleneck: This paper proposes Multi-Focus Attention Instruction (MFAI) as a semantic probe to reveal the "weakest link effect" in multi-hop QA — multi-hop reasoning performance is determined by the absolute position of the least visible evidence bucket rather than the inter-fact distance. Failures primarily stem from a recognition bottleneck rather than reasoning deficits, and System-2 reasoning models can effectively resist positional bias and misleading attention cues.
FS-Researcher: Test-Time Scaling for Long-Horizon Research Tasks with File-System-Based Agents: This paper proposes FS-Researcher, a file-system-based dual-agent framework for deep research. A Context Builder constructs a hierarchical knowledge base while a Report Writer composes reports section by section. By leveraging a persistent workspace to overcome context window limitations, the framework achieves 53.94 RACE (SOTA) on DeepResearch Bench and demonstrates a positive test-time scaling effect between context-building compute and report quality.
GanitLLM: Difficulty-Aware Bengali Mathematical Reasoning through Curriculum-GRPO: This paper presents GanitLLM, the first mathematical reasoning model that genuinely reasons in Bengali (rather than translating or reasoning in English), together with Ganit, a difficulty-annotated Bengali math dataset. The proposed Curriculum-GRPO addresses the cold-start problem in GRPO training for low-resource languages. The 4B model achieves an 8 percentage-point accuracy gain on Bn-MGSM, and the proportion of Bengali reasoning tokens increases from 14% to 88%.
Generating Effective CoT Traces for Mitigating Causal Hallucination: This paper first proposes the Causal Hallucination Rate (CHR) metric to quantify the tendency of small LLMs to over-predict causal relations in event causal identification (ECI). Through systematic experiments, two key criteria for effective CoT data are identified—sufficiently long semantic explanations paired with a distribution aligned to the target model—and a low-cost CoT data generation pipeline is designed accordingly. The pipeline reduces CHR of Qwen2.5-1.5B from 83.54% to 6.26% while improving mean accuracy to 66.00%.
How Should We Enhance the Safety of Large Reasoning Models: An Empirical Study: This paper systematically investigates how to enhance the safety of large reasoning models (LRMs) via SFT. It identifies five risky reasoning patterns—most notably weak vacillation—as the root cause of limited effectiveness in direct safety response distillation, proposes targeted distillation strategies that reduce the PAIR attack success rate from 63% to 13%, and demonstrates that short chain-of-thought and template-based reasoning achieve safety performance comparable to full-length reasoning chains.
JTPRO: A Joint Tool-Prompt Reflective Optimization Framework for Language Agents: JTPRO proposes a joint optimization framework that requires no model fine-tuning. Through reflection-driven iterative editing, it simultaneously optimizes global instructions and per-tool schema/parameter descriptions, significantly improving end-to-end success rates for tool selection and slot filling in large-scale tool library settings, achieving 5%–20% OSR gains over baselines such as GEPA.
Know Thy Enemy: Securing LLMs Against Prompt Injection via Diverse Data Synthesis and Instruction-Level Chain-of-Thought Learning: This paper proposes InstruCoT, which synthesizes diverse training data covering multiple injection vectors and threat scenarios, and introduces a three-stage instruction-level chain-of-thought fine-tuning framework based on a situation-aware model. This enables LLMs to effectively identify and reject malicious instructions under various prompt injection attacks, substantially outperforming existing defenses across three evaluation dimensions: behavioral deviation, privacy leakage, and harmful output.
Large Reasoning Models Are (Not Yet) Multilingual Latent Reasoners: This paper systematically investigates the latent reasoning behavior of large reasoning models (LRMs) across 11 languages, finding that latent reasoning capability exists multilingually but is unevenly distributed (stronger for high-resource languages, weaker for low-resource ones), and that internal reasoning dynamics tend toward an English-centric shared pathway.
Learning to Edit Knowledge via Instruction-based Chain-of-Thought Prompting: CoT2Edit proposes a new paradigm for teaching LLMs to perform knowledge editing via CoT reasoning. It constructs CoT instruction data for both structured and unstructured edits, trains with SFT warm-start followed by GRPO optimization, and retrieves edited facts via RAG at inference time. A single training run achieves SOTA across 6 editing benchmarks with strong generalization.
Logical Phase Transitions: Understanding Collapse in LLM Logical Reasoning: This paper identifies a "logical phase transition" phenomenon in LLM logical reasoning—performance collapses abruptly at specific complexity thresholds rather than degrading smoothly. The authors propose a Logical Complexity Metric (LoCM) to quantify this phenomenon, and design a Neuro-Symbolic Curriculum Tuning (NSCT) framework that achieves average accuracy gains of +1.26 over naive prompting and +3.95 over CoT across five benchmarks via adaptive neuro-symbolic alignment and complexity-aware curriculum optimization.
MARCH: Evaluating the Intersection of Ambiguity Interpretation and Multi-hop Inference: This paper introduces the MARCH benchmark (2,209 multi-hop ambiguous questions) and the CLARION framework, presenting the first systematic study of QA challenges at the intersection of ambiguity interpretation and multi-step reasoning, and revealing severe deficiencies in existing SOTA models on such problems.
MathAgent: Adversarial Evolution of Constraint Graphs for Mathematical Reasoning Data Synthesis: This paper proposes MathAgent, a hierarchical data synthesis framework based on adversarial evolution of constraint graphs. It reformulates data synthesis from a text generation task into an unsupervised optimization problem over constraint graphs. A three-agent Legislator system (Proposer-Critic-Moderator) evolves problem skeletons, which are then instantiated into natural language by an Executor. With only 1K synthetic samples, MathAgent surpasses LIMO and s1K across eight mathematical benchmarks.
OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning: OctoTools is a training-free, user-friendly, and easily extensible multi-agent framework that encapsulates heterogeneous tools via standardized tool cards, adopts a Planner-Executor separation paradigm, and employs a task-specific toolset optimization algorithm. It achieves an average accuracy improvement of +9.3% over GPT-4o and up to +10.6% over frameworks such as AutoGen and LangChain across 16 diverse benchmarks.
Parallel Test-Time Scaling for Latent Reasoning Models: This paper is the first to introduce parallel test-time scaling (parallel TTS) into latent reasoning models. It proposes two uncertainty-theoretic stochastic sampling strategies (MC-Dropout and additive Gaussian noise) along with a step-level contrastively trained latent reward model (LatentRM), enabling models that reason in continuous vector spaces to achieve consistent performance gains through parallel sampling and aggregation.
Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards: This paper proposes leveraging the Planning Domain Definition Language (PDDL) to automatically generate large-scale, high-precision step-level reward datasets for training Process Reward Models (PRMs), achieving significant improvements on both mathematical and non-mathematical reasoning benchmarks.
ReCoQA: A Benchmark for Tool-Augmented and Multi-Step Reasoning in Real Estate Question and Answering: This paper introduces ReCoQA—a large-scale benchmark comprising 29,270 real estate question-answer pairs—that requires models to perform hybrid multi-source reasoning by integrating database queries and map API calls. The authors further propose HIRE-Agent, a hierarchical multi-agent framework serving as a strong baseline, and systematically identify the bottlenecks of existing LLMs in complex reasoning within vertical domains.
Reinforced Efficient Reasoning via Semantically Diverse Exploration: ROSE proposes a semantic-entropy-guided MCTS branching strategy and a length-aware segment-level advantage estimation to address the insufficient exploration diversity and low reasoning efficiency of existing MCTS-based RLVR methods, achieving state-of-the-art pass@8 performance across multiple mathematical reasoning benchmarks.
Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning: This paper proposes Render-of-Thought (RoT), the first approach to render textual CoT reasoning steps as images. It leverages a pretrained visual encoder as a semantic anchor to align LLM hidden states to the visual embedding space, achieving 3–4× token compression and significant inference acceleration while preserving the interpretability of the reasoning chain.
Revisiting Entropy in Reinforcement Learning for Large Reasoning Models: This paper systematically investigates entropy dynamics in RLVR training of LLMs, identifies positive-advantage tokens as the primary driver of entropy collapse, and proposes Positive-Advantage Reweighting, which dynamically adjusts the loss weights of positive-advantage tokens to effectively regulate model entropy.
Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models: This paper proposes GenCluster, a scalable test-time compute framework that achieves gold-medal performance on IOI 2025 (446.75/600) with the open-weight model gpt-oss-120b, via large-scale parallel generation → behavioral clustering → tournament ranking → round-robin submission.
Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning: This paper proposes CoT-PoT, a cross-modal ensembling method that exploits the complementarity between chain-of-thought (CoT) and program-of-thought (PoT) reasoning modalities to reduce the number of samples required for self-consistency by 9.3×, resolving 78.6% of problems with only 2 samples.
Self-Reinforcing Controllable Synthesis of Rare Relational Data via Bayesian Calibration: This paper proposes RDDG, a tabular data synthesis framework based on progressive Chain-of-Thought, which guides LLMs to generate high-fidelity tabular data through coreset selection, relational mining, and a self-reinforcing feedback mechanism, achieving an average improvement of 2%+ Macro-F1 on imbalanced classification tasks.
Semantic-Aware Logical Reasoning via a Semiotic Framework: This paper proposes LogicAgent, a logical reasoning framework grounded in the Greimas Semiotic Square. By performing multi-perspective semantic analysis and reflective verification, LogicAgent achieves state-of-the-art logical reasoning performance under the dual challenges of semantic and logical complexity.
Step-GRPO: Internalizing Dynamic Early Exit for Efficient Reasoning: This paper proposes Step-GRPO, which internalizes dynamic early-exit capability into the model — measuring reasoning complexity via semantic steps rather than raw tokens, exposing concise correct trajectories through dynamically truncated rollouts, and guiding the model to learn when to stop reasoning via step-aware relative rewards. On Qwen3-8B, it reduces token consumption by 32% with no accuracy degradation.
Taming Actor-Observer Asymmetry in Agents via Dialectical Alignment: This paper identifies that LLM agents exhibit a human-like "actor-observer asymmetry" (AOA) cognitive bias during role-play — when acting as actors, agents tend to attribute failures to external factors, while as observers they tend to attribute failures to internal errors. The authors propose ReTAS, which employs dialectical reasoning (thesis–antithesis–synthesis) and GRPO-based alignment to mitigate this bias.
Think Outside the Policy: In-Context Steered Policy Optimization: This paper proposes ICPO (In-Context Steered Policy Optimization), which leverages the in-context learning (ICL) capability of large language models as implicit expert guidance to expand the policy exploration space during RLVR training, without relying on reasoning trajectories from external, stronger models.
Towards Effective In-context Cross-domain Knowledge Transfer via Domain-invariant-neurons-based Retrieval: This paper proposes DIN-Retrieval, which identifies domain-invariant neurons (DINs) in LLMs exhibiting consistent activation polarity across domains, constructs a domain-robust representational subspace for retrieving structurally compatible cross-domain demonstrations, and provides the first systematic evidence that cross-domain ICL examples can improve LLM reasoning performance, achieving an average gain of 1.8% on math-to-logic reasoning transfer.
TrigReason: Trigger-Based Collaboration between Small and Large Reasoning Models: TrigReason proposes an event-triggered collaboration framework between small and large reasoning models. By analyzing three systematic failure modes of small reasoning models (SRMs)—path deviation, cognitive overload, and recovery failure—the framework designs three corresponding triggers: strategic priming, cognitive offloading, and intervention request. These triggers replace step-wise polling verification, enabling 1.70–4.79× more reasoning steps to be offloaded to the SRM while maintaining LRM-level accuracy, reducing latency by 43.9% and API cost by 73.3%.
TROJail: Trajectory-Level Optimization for Multi-Turn Large Language Model Jailbreaks with Process Rewards: This paper frames automated multi-turn jailbreak attacks as a multi-turn reinforcement learning problem and proposes TROJail, which introduces two heuristic process rewards—over-harm penalization and semantic relevance progression—to alleviate the sparse supervision problem of outcome-only rewards, achieving substantial improvements in attack success rate across multiple models and benchmarks.
When Is Thinking Enough? Early Exit via Sufficiency Assessment for Efficient Reasoning: This paper proposes the DTSR framework, which detects "reflection signals" (e.g., Wait, Alternatively) during the reasoning process and triggers a self-assessment of the current reasoning chain's "sufficiency" at those positions to determine whether to exit early. DTSR achieves 28.9%–34.9% reasoning length reduction on the Qwen3 model series with negligible accuracy loss.

📦 Model Compression¶

A Computational Method for Measuring "Open Codes" in Qualitative Analysis: This paper proposes a theoretically grounded computational framework that employs an LLM-augmented code merging algorithm alongside four ground-truth-free metrics (Coverage, Overlap, Novelty, and Divergence) to systematically evaluate the performance of both human and AI coders in inductive qualitative coding.
A Layer-wise Analysis of Supervised Fine-Tuning: This paper conducts a systematic layer-wise analysis of SFT across 1B–32B models from three perspectives—information-theoretic, geometric, and optimization-based—revealing that instruction-following capability is concentrated in the middle layers (20%–80%) rather than uniformly distributed. Based on this finding, the paper proposes a Mid-Block Efficient Tuning strategy that selectively updates middle layers, achieving up to 10.2% improvement over standard LoRA on GSM8K.
Adaptive Layer Selection for Layer-Wise Token Pruning in LLM Inference: This paper proposes ASL (Adaptive Selection Layer), which monitors the variance of token attention score rankings to adaptively determine the layer at which KV cache pruning is performed. ASL significantly outperforms fixed-layer selection methods on difficult tasks while remaining training-free.
Analytical FFN-to-MoE Restructuring via Activation Pattern Analysis: This paper proposes an analytical post-training framework that rapidly restructures dense FFN layers into sparse MoE by analyzing neuron activation patterns — distinguishing high-frequency shared experts from low-frequency routed experts and constructing routers directly from activation statistics — achieving 1.17× speedup with only 2k-sample fine-tuning.
arXiv2Table: Toward Realistic Benchmarking and Evaluation for LLM-Based Literature-Review Table Generation: This paper proposes the arXiv2Table benchmark (1,957 tables, 7,158 papers) and introduces distractor papers, schema-agnostic user requests, and an annotation-free QA-based evaluation framework to enable more realistic assessment of LLM-based literature-review table generation, along with an iterative batch generation method.
Calibrated Speculative Decoding: Frequency-Guided Candidate Selection for Efficient Inference: CSD proposes a training-free enhancement framework for speculative decoding that records high-frequency rejection patterns via an Online Correction Memory (OCM) to provide rescue candidates, and then validates candidate reliability through a Semantic Consistency Gating (SCG) mechanism based on probability ratios. The approach achieves up to 2.33× throughput improvement over standard speculative decoding while also improving accuracy on HumanEval and MATH500.
CBRS: Cognitive Blood Request System with Bilingual Dataset and Dual-Layer Filtering: CBRS proposes a multi-platform framework that efficiently detects and parses blood donation requests from social media message streams via a dual-layer filtering architecture (lightweight classifier + LLM). The work introduces the first bilingual dataset of 11K blood donation requests spanning Bengali, English, and transliterated Bengali. A LoRA fine-tuned Llama-3.2-3B achieves 92% zero-shot accuracy on the parsing task.
ChemAmp: Amplified Chemistry Tools via Composable Agents: This paper proposes a novel "tool amplification" paradigm (distinct from conventional tool orchestration) and introduces the ChemAmp framework, which treats chemistry-specific tools (UniMol2, Chemformer, etc.) as composable building blocks to dynamically construct task-specialized super-agents. ChemAmp surpasses both domain-specific models and general-purpose LLMs on four core chemistry tasks—including molecular design and reaction prediction—while reducing inference token costs by 94%.
CLAG: Adaptive Memory Organization via Agent-Driven Clustering for Small Language Model Agents: This paper proposes CLAG, a clustering-based agent memory framework that organizes memories into semantically coherent clusters via SLM-driven routing, performs local evolution updates within clusters, and filters noise through two-stage retrieval, achieving significant improvements over global memory pool baselines across multiple QA datasets.
Compositional Steering of Large Language Models with Steering Tokens: This paper proposes compositional steering tokens that compress behavioral instructions into input-space embedding vectors via self-distillation, and trains a dedicated composition token to capture the general concept of "composition." The approach demonstrates strong generalization to unseen behavior combinations, unseen behaviors, and unseen numbers of behaviors to compose.
DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing: This paper proposes DASH-KV, a framework that reformulates the attention mechanism as an approximate nearest neighbor search problem. By employing asymmetric deep hashing to encode queries and keys into binary codes, high-dimensional floating-point similarity computation is replaced with efficient Hamming distance bit operations. Combined with a dynamic mixed-precision mechanism, the approach reduces long-context inference complexity from \(O(N^2)\) to \(O(N)\) while matching the performance of full attention.
DeepPrune: Parallel Scaling without Inter-Trace Redundancy: This paper proposes DeepPrune, which trains a dedicated judge model to predict answer equivalence from partial reasoning traces and combines it with an online greedy clustering algorithm to dynamically prune redundant parallel CoT paths. DeepPrune reduces token consumption by 65.73%–88.50% while maintaining competitive accuracy within 3 percentage points.
Efficient Learned Data Compression via Dual-Stream Feature Decoupling: This paper proposes FADE, a framework that employs a Dual-stream Multi-scale Decoupler to separate micro-syntactic and macro-semantic features into parallel shallow streams (replacing deep serial stacking), combined with a Hierarchical Gated Refiner and a Concurrent Stream Parallel Pipeline, achieving state-of-the-art performance in both compression ratio and throughput simultaneously.
Enabling Agents to Communicate Entirely in Latent Space: This paper proposes Interlat, a framework enabling LLM agents to communicate entirely in latent space. The sender directly transmits the final-layer hidden states as a continuous representation of its "thoughts"; the receiver interprets these latent messages via a communication adapter and further compresses them to as few as 8 tokens through latent-space reasoning, achieving up to 24× communication speedup while maintaining competitive performance.
Establishing a Scale for Kullback–Leibler Divergence in Language Models Across Various Settings: This paper embeds language models of diverse architectures into a unified space via log-likelihood vectors, systematically measures the characteristic KL divergence scales across multiple settings—pretraining, model scale, random seeds, quantization, fine-tuning, and inter-layer analysis—and reveals that pretraining trajectories exhibit subdiffusive behavior in log-likelihood space: despite continuous drift in weight space, the output distributions stabilize early in training.
FastKV: Decoupling of Context Reduction and KV Cache Compression for Prefill-Decoding Acceleration: This paper proposes FastKV, which decouples context reduction (Token-Selective Propagation during the prefill phase) from KV cache compression (layer-wise KV retention during the decoding phase), achieving 1.82× prefill speedup and 2.87× decoding speedup on LLaMA-3.1-8B-Instruct while limiting accuracy degradation to within 1% on LongBench.
Find Your Optimal Teacher: Personalized Data Synthesis via Router-Guided Multi-Teacher Distillation: This paper proposes PerSyn (Personalized data Synthesis), which adopts a "Route then Generate" paradigm where a router assigns the optimal teacher model to each prompt by jointly considering student learnability and teacher response quality. Compared to the conventional "Generate then Select" paradigm, PerSyn is more efficient and effective, consistently outperforming all baselines across instruction tuning and mathematical reasoning tasks.
MAGEO: From Experience to Skill — Multi-Agent Generative Engine Optimization via Reusable Strategy Learning: This paper reframes Generative Engine Optimization (GEO) from per-instance heuristic optimization to a strategy learning problem, proposing the MAGEO multi-agent framework. The execution layer consists of four collaborating agents — preference, planning, editing, and evaluation — operating in an iterative Generate-Evaluate-Select loop, while the learning layer distills validated edit patterns into reusable engine-specific strategy skills. A Twin Branch causal evaluation protocol and the DSV-CF dual-axis metric are introduced, achieving substantial improvements over heuristic baselines across three mainstream generative engines.
From Weights to Activations: Is Steering the Next Frontier of Adaptation?: This paper systematically argues that steering (inference-time activation-space intervention) should be recognized as an independent model adaptation paradigm. It proposes eight functional evaluation criteria to compare steering against fine-tuning, PEFT, and prompt engineering, positioning steering as a locally reversible, activation-space behavior modification approach with unique advantages in computational efficiency, data efficiency, and reversibility.
HeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM Inference: This paper proposes HeteroCache, a training-free dynamic KV cache compression framework that exploits two dimensions of attention head characteristics—temporal heterogeneity (stable vs. drifting heads) and intra-layer redundancy (clustering of similar heads)—to implement fine-grained role assignment. Larger cache budgets are allocated to drifting heads, while representative heads sparsely monitor attention drift to trigger asynchronous on-demand retrieval, achieving 3× decoding speedup under 224K context.
IMPACT: Importance-Aware Activation Space Reconstruction: This paper proposes IMPACT, a framework that shifts LLM low-rank compression from minimizing weight reconstruction error to minimizing importance-weighted activation reconstruction error. By incorporating gradient information into the activation covariance matrix, IMPACT derives a closed-form optimal solution, achieving up to 55.4% model size reduction while preserving accuracy.
CadLLM: Improving the Throughput of Diffusion-based LLMs via Training-Free Confidence-Aware Calibration: This paper proposes CadLLM, a training-free adaptive inference acceleration method that leverages token decoding confidence signals in diffusion large language models (dLLMs) to dynamically adjust four dimensions—block size, number of steps, vocabulary sampling range, and commitment threshold—achieving 1.1–2.28× throughput improvements on LLaDA and DREAM while maintaining competitive accuracy.
JudgeMeNot: Personalizing Large Language Models to Emulate Judicial Reasoning in Hebrew: This paper proposes a synthetic-organic supervision pipeline that transforms raw judicial opinions into reasoning instruction-tuning data. Through a Chain-of-LoRA strategy (CLM → instruction tuning), the framework achieves high-fidelity emulation of individual judges' reasoning styles, producing outputs indistinguishable from authentic judicial writing in the low-resource Hebrew setting.
Latent-Condensed Transformer for Efficient Long Context Modeling: LCA proposes performing context compression directly in the latent space of MLA — aggregating semantic latent vectors via query-aware weighted pooling and preserving positional accuracy through anchor selection for positional keys — achieving 2.5× prefill speedup and 90% KV cache compression on 128K contexts while maintaining competitive performance.
LLM Prompt Duel Optimizer: Efficient Label-Free Prompt Optimization: This work formalizes label-free prompt optimization as a dueling bandit problem and proposes the Prompt Duel Optimizer (PDO), which employs Double Thompson Sampling to efficiently select the most informative prompt pairs for comparison. Combined with a top-performer mutation strategy to expand the search space, PDO identifies stronger prompts on BBH and MS MARCO with fewer judge calls than existing baselines.
LoRA on the Go: Instance-level Dynamic LoRA Selection and Merging: This paper proposes LoGo (LoRA on the Go), a training-free framework that extracts LoRA activation signals (norm or entropy) via a single forward pass to dynamically select and merge the most relevant LoRA adapters at the instance level, enabling cross-task generalization without labeled data or additional training.
MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization: This paper proposes MAESTRO, which reformulates reward scalarization in GRPO as a contextual bandit problem. A lightweight Conductor network leverages the final-layer hidden states of the policy model to adaptively select reward weights for each prompt–response pair, consistently outperforming static-reward and single-reward baselines across seven open-domain benchmarks.
Mem²Evolve: Towards Self-Evolving Agents via Co-Evolutionary Capability Expansion and Experience Distillation: This paper proposes Mem²Evolve, a self-evolving agent framework that achieves co-evolutionary capability expansion and experience distillation via a dual-memory mechanism (Asset Memory + Experience Memory). The framework attains an average Pass@1 of 70.24% across 8 benchmarks spanning 6 task categories, outperforming the strongest experience-centric and capability-centric baselines by 11.80% and 6.46%, respectively.
Mem^p: Exploring Agent Procedural Memory: This paper proposes the Mem^p framework, which systematically investigates how to construct learnable, updatable, and lifelong-evolving procedural memory for LLM agents. By distilling past task trajectories into fine-grained step-by-step instructions and high-level script abstractions, coupled with a dynamic update mechanism (addition / validation / reflection / retirement), Mem^p achieves consistent improvements in success rate and substantial reductions in execution steps on TravelPlanner and ALFWorld.
Meta-Tool: Efficient Few-Shot Tool Adaptation for Small Language Models: Through systematic comparison of hypernetwork-based LoRA adaptation versus carefully designed few-shot prompting across four benchmarks, this work demonstrates that a 227.8M-parameter hypernetwork yields zero gain—few-shot examples contribute +21.5%, document encoding contributes +5.0%, and the hypernetwork contributes 0%. A 3B model with well-crafted prompts achieves 79.7% of GPT-5's average performance at 10× lower latency.
Model Internal Sleuthing: Finding Lexical Identity and Inflectional Features in Modern Language Models: This paper systematically probes 25 Transformer language models (ranging from BERT Base to Qwen2.5-7B) and finds that lexical identity (lexeme) is linearly decodable in early layers but decays with depth, while inflectional features remain stably readable across all layers and occupy compact, controllable subspaces.
No-Worse Context-Aware Decoding: Preventing Neutral Regression in Context-Conditioned Generation: This paper proposes NWCAD, a decoding-time adapter that employs a two-stage gating mechanism to precisely fall back to context-free decoding when the context is uninformative (preventing neutral regression), and to leverage context for correction when it is helpful — simultaneously satisfying the objectives of "do no harm" and "be effective."
Polynomial Expansion Rank Adaptation: Enhancing Low-Rank Fine-Tuning with High-Order Interactions: This paper proposes PERA (Polynomial Expansion Rank Adaptation), which introduces structured polynomial expansions (square and cross terms) into the parameter space of low-rank factors, extending LoRA's linear adaptation space into a polynomial manifold. Without increasing rank or inference overhead, PERA significantly enhances the expressiveness of weight updates and consistently outperforms LoRA, DoRA, and HiRA on commonsense reasoning and NLU tasks.
Reason Only When Needed: Efficient Generative Reward Modeling via Model-Internal Uncertainty: This paper proposes E-GRM, a framework that estimates uncertainty from the convergence behavior of parallel decoding, triggers CoT reasoning only when necessary, and employs a discriminative scorer trained with a hybrid loss to evaluate reasoning path quality. E-GRM achieves state-of-the-art performance across multiple reward modeling benchmarks while reducing inference latency by 62%.
Representation-Guided Parameter-Efficient LLM Unlearning: This paper proposes ReGLU, a framework that shifts LLM unlearning from a "parameter importance" paradigm to a "representation space geometry" paradigm. It introduces Representation-guided Initialization for LoRA Adaptation (RILA), which aligns unlearning updates to the most discriminative subspace between the forget and retain sets, and a Representation Orthogonality Loss (ROL) that constrains updates from interfering with retain-set knowledge.
SAMoRA: Semantic-Aware Mixture of LoRA Experts for Task-Adaptive Learning: SAMoRA addresses imprecise routing and inflexible weight fusion in existing MoE-LoRA methods through a semantic-aware router and a task-adaptive scaling mechanism, achieving state-of-the-art performance on multi-task benchmarks with only 0.15% trainable parameters.
SCURank: Ranking Multiple Candidate Summaries with Summary Content Units for Enhanced Summarization: This paper proposes SCURank, a ranking framework based on Summary Content Units (SCUs). It extracts SCUs from candidate summaries, estimates information importance via cross-summary clustering, and scores candidates by informativeness. SCURank replaces unstable LLM-based direct ranking and coarse-grained ROUGE-based ranking. Combined with BRIO contrastive learning in a multi-LLM distillation setting, it significantly improves the summarization performance of distilled models.
SeLaR: Selective Latent Reasoning in Large Language Models: This paper proposes SeLaR, a lightweight training-free framework that activates soft-embedding latent reasoning exclusively at high-entropy "exploration steps" via an entropy gating mechanism, while retaining discrete decoding at high-confidence "certainty steps." An entropy-aware contrastive regularization is introduced to prevent soft embeddings from collapsing toward the dominant token. SeLaR consistently outperforms standard CoT and state-of-the-art training-free methods across five reasoning benchmarks.
Supplement Generation Training for Enhancing Agentic Task Performance: SGT (Supplement Generation Training) trains a small LLM (1.7B) to generate instance-specific supplement text (reasoning cues, summaries, error reminders, etc.) that is appended to the input, enabling a frozen large Actor model to solve tasks more effectively. SGT achieves an average improvement of 21% across 5 benchmarks without modifying the Actor's parameters.
Task-Stratified Knowledge Scaling Laws for Post-Training Quantized LLMs: This paper establishes the first task-stratified knowledge scaling laws for post-training quantization (PTQ), decomposing LLM capabilities into three levels—memorization, application, and reasoning—and jointly modeling four factors: model size, bit-width, group size, and calibration set size. The laws are validated across 293 PTQ configurations, revealing differentiated patterns: reasoning is sensitive to precision, application improves with scale, and memorization is sensitive to calibration data.
Training-Free Test-Time Contrastive Learning for Large Language Models: This paper proposes TF-TTCL, a gradient-free test-time contrastive learning framework that enables a frozen LLM to self-improve online through an Explore–Reflect–Guide cycle. It employs multi-agent role-playing to generate diverse reasoning trajectories, distills textual rules from positive–negative contrastive pairs into a memory bank, and retrieves relevant rules at inference time to guide generation.
UKP_Psycontrol at SemEval-2026 Task 2: Modeling Valence and Arousal Dynamics from Text: UKP_Psycontrol achieves first place on both subtasks of SemEval-2026 Task 2 by combining LLM prompting, a MaxEnt model with Ising interactions, and a neural regression model. The system reveals that LLMs excel at capturing static affective signals, whereas short-term affective changes are better explained by recent numerical trajectories than by textual semantics.
Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment: This paper proposes the Rank-Surprisal Ratio (RSR), a metric that jointly measures the informativeness and alignment of reasoning trajectories with respect to a student model, achieving an average Spearman correlation of 0.86 with post-training performance across 5 student models and 11 teacher models, and demonstrating utility in both trajectory selection and teacher selection.
WISCA: A Lightweight Model Transition Method to Improve LLM Training via Weight Scaling: This paper proposes an Equivalent Model Theory and the WISCA weight scaling strategy, which dynamically balances the L1 norms of \(W_q/W_k\) and \(W_v/W_o\) in Transformer attention layers during training—without altering model outputs—to steer optimization toward flatter loss minima. On GQA architectures, WISCA achieves an average 5.6% improvement on zero-shot evaluation and a 2.12% reduction in training perplexity.
YIELD: A Large-Scale Dataset and Evaluation Framework for Information Elicitation Agents: This paper introduces the Information Elicitation Agent (IEA) as a novel conversational paradigm, releases YIELD — the first large-scale human-human information elicitation dialogue dataset (2,281 conversations, 26M tokens) — formalizes the elicitation process as a finite-horizon POMDP, and proposes dedicated evaluation metrics (Conformity, Progression, TLR). Experiments demonstrate that fine-tuning on YIELD significantly improves LLM alignment with authentic elicitation behavior.

🔍 Information Retrieval & RAG¶

All Languages Matter: Understanding and Mitigating Language Bias in Multilingual RAG: This paper systematically reveals severe language bias (favoring English and the query language) in the reranking stage of multilingual RAG systems, and proposes the LAURA framework, which aligns the reranker via supervision signals driven by downstream generation quality, effectively mitigating bias and improving generation performance.
An Iterative Utility Judgment Framework Inspired by Philosophical Relevance via LLMs: Inspired by Schutz's philosophical theory of relevance, this paper proposes ITEM, an iterative utility judgment framework that enables the three core RAG components—relevance ranking, utility judgment, and answer generation—to mutually and dynamically enhance one another, yielding improvements over baselines across retrieval, utility judgment, and QA tasks.
Bayesian Active Learning with Gaussian Processes Guided by LLM Relevance Scoring: This paper proposes BAGEL, a Bayesian active learning framework based on Gaussian Processes (GP) that propagates sparse LLM relevance signals across the embedding space via an exploration–exploitation strategy under a limited LLM budget, enabling global passage retrieval that substantially outperforms conventional LLM re-ranking methods.
Beyond Black-Box Interventions: Latent Probing for Faithful Retrieval-Augmented Generation: This paper proposes ProbeRAG, which discovers the linear separability of conflicting and aligned knowledge in LLM latent spaces, and designs a three-stage framework (fine-grained knowledge pruning → latent conflict probing → conflict-aware attention) to address RAG faithfulness from the perspective of the model's internal mechanisms.
Beyond Explicit Refusals: Soft-Failure Attacks on Retrieval-Augmented Generation: This paper formally defines the "soft-failure" threat in RAG systems (generating fluent but uninformative responses), proposes DEJA, a black-box evolutionary attack framework that injects adversarial documents to exploit safety alignment mechanisms and induce ambiguous responses, achieving a Soft Attack Success Rate (SASR) exceeding 79% with high stealthiness.
CarO: Chain-of-Analogy Reasoning Optimization for Robust Content Moderation: This paper proposes CarO (Chain-of-Analogy Reasoning Optimization), a two-stage training framework that enables LLMs to autonomously generate analogical reference cases during inference for content moderation. The framework combines RAG-guided analogical chain generation, SFT, and customized DPO. On ambiguous moderation benchmarks, CarO achieves an average F1 improvement of 24.9%, substantially outperforming reasoning models (DeepSeek R1) and dedicated moderation models (LLaMA Guard).
ChAIRO: Contextual Hierarchical Analogical Induction and Reasoning Optimization for LLMs: This paper proposes ChAIRO, a Contextual Hierarchical Analogical Induction and Reasoning Optimization framework that employs a three-stage pipeline (analogical case generation → rule induction → rule-injected fine-tuning) to enable LLMs to autonomously generate analogical cases and induce explicit moderation rules for content moderation. ChAIRO achieves a 4.5% F1 improvement over single-instance rule generation and a 2.3% improvement over static RAG.
ChunQiuTR: Time-Keyed Temporal Retrieval in Classical Chinese Annals: This paper proposes ChunQiuTR, the first temporal retrieval benchmark built upon a non-Gregorian calendar system, constructed from the Spring and Autumn Annals and its exegetical traditions. It further introduces CTD (Calendrical Temporal Dual-encoder), which achieves temporally-aware retrieval via Fourier-based absolute calendrical context and relative temporal offset biases, substantially outperforming pure semantic baselines.
CodePromptZip: Code-specific Prompt Compression for Retrieval-Augmented Generation in Coding Tasks with LMs: This paper proposes CodePromptZip, the first code-specific prompt compression framework, which constructs training data via type-aware priority ranking and trains a small-model compressor with a copy mechanism. It achieves improvements of 23.4%, 28.7%, and 8.7% over the best baseline across three coding tasks.
Conjecture and Inquiry: Quantifying Software Performance Requirements via Interactive Retrieval-Augmented Preference Elicitation: This paper proposes IRAP (Interactive Retrieval-Augmented Preference Elicitation), a method that quantifies natural-language software performance requirements into mathematical functions. Evaluated on 4 real-world datasets against 10 state-of-the-art baselines, IRAP achieves up to 40× performance improvement using only 5 interaction rounds.
Context Attribution with Multi-Armed Bandit Optimization: This paper proposes CAMAB, which frames context attribution in RAG — identifying which context segments contribute to the generated answer — as a Combinatorial Multi-Armed Bandit (CMAB) problem. Using Linear Thompson Sampling to adaptively explore the space of context subsets, CAMAB reduces model queries by up to 30% compared to SHAP and ContextCite while matching or surpassing attribution quality on HotpotQA, CNN/DM, and TyDi QA.
CounterRefine: Answer-Conditioned Counterevidence Retrieval for Inference-Time Knowledge Repair in Factual Question Answering: This paper proposes CounterRefine, a lightweight inference-time repair layer: a standard RAG pipeline first generates a preliminary answer, which then conditions a counterevidence retrieval step to collect supporting and contradicting evidence; a constrained KEEP/REVISE decision gate combined with deterministic validation corrects erroneous answers, improving GPT-5's accuracy on SimpleQA from 67.3% to 73.1%.
CRAFT: Training-Free Cascaded Retrieval for Tabular QA: This paper proposes CRAFT, a three-stage cascaded table retrieval framework requiring no dataset-specific training (SPLADE sparse filtering → semantic mini-table ranking → neural re-ranking). By augmenting table representations with Gemini-generated captions and descriptions, CRAFT achieves SOTA on NQ-Tables (R@1 49.84), demonstrates strong zero-shot generalization on OTT-QA, and exhibits remarkable robustness to query paraphrasing.
CURaTE: Continual Unlearning in Real Time with Ensured Preservation of LLM Knowledge: CURaTE proposes a behavioral unlearning framework based on sentence embedding matching: a general-purpose unlearning embedder is trained prior to deployment (without any forget set); after deployment, embeddings of incoming unlearning requests are stored in a database; at inference time, cosine similarity determines whether to answer or refuse a query. LLM weights are never modified, yielding near-perfect knowledge preservation.
Detecting RAG Extraction Attack via Dual-Path Runtime Integrity Game: This paper proposes CanaryRAG, a RAG runtime defense mechanism inspired by stack canaries in software security. By injecting non-semantic canary tokens into retrieved chunks and designing a dual-path integrity game — the target path should not leak canary tokens, while the Oracle path should be able to elicit them — CanaryRAG detects knowledge base extraction attacks in real time, achieving state-of-the-art protection without compromising task performance or inference latency.
Domain-Specific Data Generation Framework for RAG Adaptation: This paper proposes RAGen, a scalable and modular data generation framework that automatically synthesizes domain-specific QAC (Question-Answer-Context) data through document-level concept extraction, multi-chunk evidence assembly, and Bloom's Taxonomy-guided question generation. The framework supports contrastive fine-tuning of embedding models and supervised fine-tuning of LLMs, achieving substantial improvements over AutoRAG and LlamaIndex baselines across three domain-specific datasets.
DQA: Diagnostic Question Answering for IT Support: This paper proposes the DQA framework, which achieves systematic fault diagnosis in enterprise IT support by maintaining persistent diagnostic states and aggregating retrieved evidence at the root-cause level rather than processing documents individually. The success rate improves from a baseline of 41.3% to 78.7%, while the average number of turns decreases from 8.4 to 3.9.
End-to-End Optimization of LLM-Driven Multi-Agent Search Systems via Heterogeneous-Group-Based Reinforcement Learning: This paper proposes MHGPO (Multi-Agent Heterogeneous Group Policy Optimization), a critic-free multi-agent RL method that achieves end-to-end optimization in a three-agent search system (Rewriter→Reranker→Answerer) through heterogeneous-group relative advantage estimation and backward reward propagation. The method captures implicit cross-agent dependencies and cross-trajectory correlations, significantly outperforming MAPPO and GRPO baselines on multi-hop QA benchmarks such as HotpotQA.
Enhancing LLM-based Search Agents via Contribution Weighted Group Relative Policy Optimization: CW-GRPO reframes process supervision as "advantage redistribution": a LLM judge evaluates the retrieval utility and reasoning correctness of each search turn, computes a contribution score to rescale outcome-based advantages, and achieves turn-level credit assignment without introducing an unstable value function. The approach outperforms standard GRPO by 5.0% on Qwen3-8B.
Enhancing Multilingual RAG Systems with Debiased Language Preference-Guided Query Fusion: This paper demonstrates that the apparent "English preference" in multilingual RAG systems is primarily an artifact of structural priors embedded in evaluation benchmarks (i.e., gold evidence concentrated in English and cultural priors) rather than an intrinsic model bias. The authors propose a debiased language preference metric, DeLP, which reveals that retrievers actually prefer monolingual alignment. Building on this insight, they design the DELTA query augmentation framework, which consistently outperforms English-pivot strategies on multilingual RAG benchmarks.
FAITH: Factuality Alignment through Integrating Trustworthiness and Honestness: This paper proposes FAITH, a framework that maps LLM uncertainty signals (consistency + semantic entropy) to natural-language descriptions of knowledge state quadrants (trustworthiness × honestness), designs uncertainty-aware fine-grained reward functions for PPO training, and applies a RAG module to correct potentially erroneous outputs, systematically improving the factual accuracy of LLMs.
Feedback Adaptation for Retrieval-Augmented Generation: This paper proposes feedback adaptation as a new problem setting for RAG systems—investigating how quickly and effectively corrective feedback propagates to future queries. It defines two evaluation axes, correction latency and post-feedback performance, and introduces PatchRAG as a training-free, inference-time feedback integration approach that achieves immediate correction and strong generalization.
FLARE: Task-Agnostic Embedding Model Evaluation via Normalizing Flows: This paper proposes FLARE, a label-free text embedding model evaluation framework based on normalizing flows. By estimating informational sufficiency directly from log-likelihoods, FLARE avoids the collapse of distance-based density estimation in high-dimensional spaces, achieving a Spearman \(\rho\) of 0.90 against supervised baselines across 11 datasets.
From Relevance to Authority: Authority-aware Generative Retrieval in Web Search Engines: This paper proposes AuthGR, the first framework to systematically integrate document authority into generative retrieval. It combines VLM-based multimodal authority scoring, a three-stage progressive training pipeline (CPT→SFT→GRPO), and a hybrid ensemble deployment pipeline. The approach is validated through large-scale A/B testing on Naver's commercial search engine, demonstrating significant improvements in user engagement.
How Retrieved Context Shapes Internal Representations in RAG: This paper systematically analyzes how retrieved documents influence the internal states of LLMs in RAG from the perspective of hidden representations, identifying five key patterns: random documents induce large representation drift and trigger refusal behavior; relevant documents primarily confirm rather than alter parametric knowledge; a single relevant document can anchor representations in multi-document settings; later layers progressively emphasize parametric knowledge, thereby limiting the influence of retrieved evidence; and LLMs can distinguish random documents in early layers but fail to reliably separate distractor documents from relevant ones even at the final layer.
Hybrid-Vector Retrieval for Visually Rich Documents: Combining Single-Vector Efficiency and Multi-Vector Accuracy: HEAVEN proposes a plug-and-play two-stage hybrid-vector framework that accelerates coarse retrieval via Visual Summary Pages (VS-Pages) with a single-vector model and reduces multi-vector reranking computation via POS-based query token filtering. Across four benchmarks, the framework retains 99.87% of the multi-vector Recall@1 while reducing per-query FLOPs by 99.82%.
Is Agentic RAG Worth It? An Experimental Comparison of RAG Approaches: This paper systematically compares Enhanced RAG and Agentic RAG across four dimensions—user intent handling, query rewriting, document refinement, and underlying LLM selection—on four datasets. The results show that each paradigm has distinct advantages: Agentic RAG is more flexible in intent routing and query rewriting, while Enhanced RAG is more effective in document reranking. Notably, Agentic RAG incurs up to 3.3× higher cost.
KoCo-Bench: Can Large Language Models Leverage Domain Knowledge in Software Development?: KoCo-Bench introduces the first code benchmark with an explicit domain knowledge corpus, covering 11 frameworks and 25 projects across 6 emerging domains (RL, Agent, RAG, etc.). It evaluates LLMs' ability to acquire and apply domain knowledge for code generation and knowledge comprehension, revealing that even the strongest coding agent, Claude Code, achieves only 34.2%.
MAB-DQA: Addressing Query Aspect Importance in Document Question Answering with Multi-Armed Bandits: This paper proposes MAB-DQA, a framework that decomposes complex queries into multiple aspect sub-queries, dynamically evaluates the importance of each aspect via a multi-armed bandit mechanism (Thompson Sampling), and redistributes retrieval budgets accordingly, achieving significant improvements in retrieval precision and answer accuracy for multimodal document question answering.
MASS-RAG: Multi-Agent Synthesis Retrieval-Augmented Generation: This paper proposes MASS-RAG, a training-free multi-agent synthesis RAG framework that employs three specialized filtering agents—Summarizer, Extractor, and Reasoner—to process retrieved documents from complementary perspectives, followed by a Synthesis Agent that integrates multi-perspective evidence or candidate answers, consistently outperforming strong baselines across four benchmarks.
Multi-Faceted Self-Consistent Preference Alignment for Query Rewriting in Conversational Search: This paper proposes MSPA-CQR, which constructs self-consistent preference data across three dimensions—rewriting, retrieval, and response—and trains a query rewriting model via prefix-guided multi-dimensional DPO, achieving significant improvements over existing methods in both in-distribution and out-of-distribution settings.
RACER: Retrieval-Augmented Contextual Rapid Speculative Decoding: RACER proposes a training-free speculative decoding method that unifies retrieval-based exact pattern matching with logits-based future prediction. It constructs a Logits Tree via a copy-logit strategy and a Retrieval Tree via an LRU-eviction Aho-Corasick automaton, achieving over 2× inference speedup across multiple benchmarks.
ReasonEmbed: Enhanced Text Embeddings for Reasoning-Intensive Document Retrieval: ReasonEmbed introduces three technical innovations—ReMixer, a non-trivial synthetic data pipeline (82K high-quality samples); Redapter, an adaptive reasoning-intensity-weighted training strategy; and multi-backbone implementation—achieving an nDCG@10 of 38.1 on the BRIGHT benchmark, surpassing all existing text embedding models by approximately 10 points.
Region-R1: Reinforcing Query-Side Region Cropping for Multi-Modal Re-Ranking: This paper proposes Region-R1, which formulates query-side region cropping in multimodal re-ranking as a decision-making problem. Via reinforcement learning (r-GRPO), the model learns when and how to crop question-relevant regions from the query image, achieving improvements of 20% and 8% in CondRecall@1 on E-VQA and InfoSeek, respectively.
RepoShapley: Shapley-Enhanced Context Filtering for Repository-Level Code Completion: This paper proposes RepoShapley, a coalition-aware context filtering framework based on Shapley values, which estimates the interactive contribution of retrieved code snippets in combination to determine whether each snippet should be retained or discarded, thereby significantly improving repository-level code completion quality.
Prune-then-Merge: Towards Efficient Multi-Vector Visual Document Retrieval: This paper proposes Prune-then-Merge, a two-stage training-free multi-vector document compression framework. It first removes low-information patches via adaptive attention-based pruning, then applies hierarchical clustering to merge the remaining high-signal patches. Evaluated across 29 VDR datasets, the framework extends the near-lossless compression range from 50–60% to 60–70% and significantly outperforms single-stage methods at high compression ratios of 80%+.
SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding: This paper proposes SlideAgent, a hierarchical agentic framework that constructs structured knowledge representations via three dedicated agents operating at the global, page, and element levels, achieving significant improvements in fine-grained understanding of multi-page visual documents, particularly presentation slides.
Stable-RAG: Mitigating Retrieval-Permutation-Induced Hallucinations in Retrieval-Augmented Generation: This paper identifies a critical yet previously overlooked vulnerability in RAG systems—high sensitivity to the ordering of retrieved documents—and proposes Stable-RAG, which applies spectral clustering over hidden states induced by document permutations to identify dominant reasoning patterns, then uses DPO alignment to redirect hallucinated outputs toward correct answers, achieving simultaneous improvements in accuracy and reasoning consistency across three QA datasets.
TaxPraBen: A Scalable Benchmark for Structured Evaluation of LLMs in Chinese Real-World Tax Practice: This paper introduces TaxPraBen, the first LLM evaluation benchmark targeting Chinese real-world tax practice. It comprises 14 datasets with 7.3K samples spanning three authentic scenarios—tax risk prevention, tax inspection analysis, and tax planning. The paper proposes a scalable evaluation paradigm based on structured parsing, field-aligned extraction, and hybrid numerical–textual matching. Evaluation of 19 LLMs reveals that closed-source models and Chinese-optimized models outperform others, while YaYi2, a tax-domain fine-tuned model, yields only marginal improvements.
To Lie or Not to Lie? Investigating The Biased Spread of Global Lies by LLMs: This paper introduces GlobalLies—a multilingual parallel dataset comprising 440 misinformation generation templates and 6,867 entities across 8 languages and 195 countries—and reveals that LLMs exhibit systematic country-level and language-level biases in misinformation propagation: compliance rates are significantly higher for low-HDI countries (statistical correlation \(\rho=-0.355\), \(p=5\times10^{-7}\)), low-resource languages elicit compliance rates more than 30% higher than English, and existing safety classifiers and RAG-based safeguards provide uneven protection globally.
TPA: Next Token Probability Attribution for Detecting Hallucinations in RAG: This paper proposes TPA, a framework that mathematically decomposes the generation probability of each token in an LLM into contributions from seven sources (Query, RAG Context, Past Token, Self Token, FFN, Final LayerNorm, and Initial Embedding), and combines part-of-speech (POS) tagging for feature aggregation to achieve state-of-the-art hallucination detection in RAG settings.
Understanding Structured Financial Data with LLMs: A Case Study on Fraud Detection: This paper proposes FinFRE-RAG, a two-stage framework that serializes high-dimensional tabular transaction data into natural language via importance-guided feature reduction, and combines label-aware retrieval-augmented in-context learning to substantially improve F1/MCC of open-source LLMs on financial fraud detection, narrowing the performance gap with specialized tabular classifiers.
VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG: VideoStir proposes a structured and intent-aware RAG framework for long video understanding. By modeling videos as spatio-temporal graphs for multi-hop clip retrieval and training an intent relevance scorer for frame-level filtering, the framework achieves performance comparable to state-of-the-art long video RAG methods without relying on any auxiliary text tools.
Why These Documents? Explainable Generative Retrieval with Hierarchical Category Paths: This paper proposes HyPE, a generative retrieval framework that first generates hierarchical category paths (e.g., "Government >> Government by cities") before decoding document identifiers, providing query-relevant explanations for retrieval results while simultaneously improving retrieval accuracy.

🦾 LLM Agent¶

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts: This paper proposes AgencyBench — a comprehensive benchmark comprising 138 real-world tasks that evaluates six core agent capabilities. Each scenario requires an average of 90 tool calls and 1M tokens. Fully automated evaluation is achieved via a user simulation agent and Docker sandbox.
Agent-GWO: Collaborative Agents for Dynamic Prompt Optimization in Large Language Models: This paper proposes Agent-GWO, which integrates the leader-follower hierarchy of the Grey Wolf Optimizer (GWO) into a multi-agent framework to jointly optimize prompt templates and decoding hyperparameters (temperature, top-p, etc.), consistently outperforming existing prompt optimization methods across 11 mathematical and mixed reasoning benchmarks.
ATLAS: Adaptive Trading with LLM AgentS Through Dynamic Prompt Optimization and Multi-Agent Coordination: This paper proposes ATLAS, a multi-agent financial trading framework, and Adaptive-OPRO, a prompt optimization method. ATLAS employs specialized analyst agents to prepare heterogeneous market information and dynamically optimizes the instruction prompt of a central trading agent based on delayed and noisy feedback, achieving significant improvements over baselines across diverse market volatility conditions.
Bayesian Social Deduction with Graph-Informed Language Models: This paper proposes GRAIL (Graph Reasoning Agent Informed through Language), a hybrid reasoning framework that externalizes probabilistic inference to a factor graph model while delegating language understanding and interaction to an LLM. GRAIL is the first agent to defeat human players in the social deduction game Avalon (67% win rate) while consuming far fewer computational resources than large-scale reasoning models.
CI-Work: Benchmarking Contextual Integrity in Enterprise LLM Agents: CI-Work is an enterprise-scenario benchmark grounded in Contextual Integrity (CI) theory. It reveals that state-of-the-art LLM agents systematically violate privacy norms in enterprise workflows, and that scaling model size exacerbates rather than mitigates leakage.
CodeStruct: Code Agents over Structured Action Spaces: This paper proposes CodeStruct, a framework that redefines code repositories as AST-based structured action spaces, enabling LLM code agents to read and edit code via named program entities rather than raw text fragments. CodeStruct achieves 1.2–5.0% accuracy improvements on SWE-Bench Verified while reducing token consumption by 12–38%.
CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution: CoEvolve proposes an agent-data co-evolution framework that extracts three types of weakness signals—forgetting, boundary, and rare—from training trajectories, guiding LLMs to perform targeted environment re-exploration and task synthesis. This allows the training data distribution to dynamically adapt to the agent's evolving capabilities, yielding absolute improvements of 19–23% on AppWorld and BFCL.
Conjunctive Prompt Attacks in Multi-Agent LLM Systems: This paper investigates conjunctive prompt attacks in multi-agent LLM systems: a trigger key embedded in a user query and a hidden template injected into a compromised remote agent each appear benign in isolation, yet activate harmful behavior when routing brings them together at the same agent. Existing defenses (PromptGuard, Llama-Guard, etc.) fail to reliably prevent such attacks.
Creating ConLangs to Probe the Metalinguistic Grammatical Knowledge of LLMs: This paper introduces IASC (Interactive Agentic System for ConLangs), a modular constructed-language building system that probes LLMs' metalinguistic knowledge by requiring them to perform morphosyntactic transformations according to linguistic specifications. The findings reveal that LLMs handle common typological patterns far better than rare ones, and that capability gaps across different LLMs are substantial.
Discover and Prove: An Open-source Agentic Framework for Hard Mode Automated Theorem Proving in Lean 4: DAP introduces the concept of Hard Mode ATP (where AI must independently discover answers before constructing proofs, as opposed to Easy Mode statements with embedded answers), releases the MiniF2F-Hard and FIMO-Hard benchmarks, and proposes a two-stage "Discover and Prove" framework — using LLM natural language reasoning to discover answers, then rewriting the statement into an Easy Mode declaration for a formal prover. The framework improves solved problems on CombiBench from 7 to 10 and, for the first time, proves 36 theorems on PutnamBench in Hard Mode.
Diversity Collapse in Multi-Agent LLM Systems: Structural Coupling and Collective Failure in Open-Ended Idea Generation: Through evaluating over 10,000 research proposals, this paper systematically reveals the phenomenon of "diversity collapse" in multi-agent LLM systems across three levels — model intelligence, agent cognition, and system dynamics. Stronger models, authority-driven role assignments, and dense communication topologies all suppress semantic diversity, with the root cause residing in interaction structure rather than insufficient model capability.
EA-Agent: A Structured Multi-Step Reasoning Agent for Entity Alignment: This paper proposes EA-Agent, which decomposes entity alignment (EA) into a structured multi-step reasoning process. Through planning and execution over a tool pool (triple selector + alignment tool + reflector), EA-Agent achieves interpretable alignment decisions. Combined with reward-guided offline policy optimization for continuous improvement of planning capability, it achieves up to 3.17% Hits@1 improvement on DBP15K while reducing efficiency issues caused by redundant triples.
ExpSeek: Self-Triggered Experience Seeking for Web Agents: ExpSeek proposes a step-level entropy self-triggered framework for proactive experience seeking, enabling web agents to determine when and what guidance is needed based on intrinsic model signals during interaction, achieving absolute improvements of 9.3% and 7.5% on Qwen3-8B and Qwen3-32B, respectively.
FairQE: Multi-Agent Framework for Mitigating Gender Bias in Translation Quality Estimation: This paper proposes FairQE, a multi-agent framework that mitigates systematic gender bias in QE models through gender cue detection, gender-flipped variant generation, and dynamic bias-aware score aggregation, without sacrificing translation quality estimation accuracy.
FedGUI: Benchmarking Federated GUI Agents across Heterogeneous Platforms, Devices, and Operating Systems: FedGUI is the first comprehensive federated learning benchmark for cross-platform GUI agents, comprising six datasets covering mobile, web, and desktop environments. It systematically investigates the effects of four types of heterogeneity—cross-platform, cross-device, cross-OS, and cross-source—on federated GUI agent training.
FregeLogic at SemEval 2026 Task 11: A Hybrid Neuro-Symbolic Architecture for Content-Robust Syllogistic Validity Prediction: FregeLogic is a hybrid neuro-symbolic system that combines a five-member LLM ensemble with a Z3 SMT solver as a tiebreaker, achieving a 16% reduction in belief bias alongside a 0.9% accuracy improvement on syllogistic validity prediction.
From Query to Counsel: Structured Reasoning with a Multi-Agent Framework and Dataset for Legal Consultation: This paper introduces JurisCQAD—a large-scale dataset of 43,000+ real Chinese legal consultations—and proposes the JurisMA multi-agent framework, which performs structured task decomposition via a legal element graph and dynamic multi-agent collaboration (Manager Agent + Format Check + Law Search), achieving significant improvements over both general-purpose and law-specialized LLMs on LawBench.
HAG: Hierarchical Demographic Tree-based Agent Generation for Topic-Adaptive Simulation: This paper proposes HAG, a framework that formalizes population-level agent generation as a two-stage hierarchical decision process — first constructing a topic-adaptive demographic distribution tree via a world knowledge model to achieve macro-level distributional alignment, then combining real-data retrieval with LLM-based agent augmentation to ensure micro-level individual consistency. HAG reduces population alignment error by an average of 37.7% and improves sociological consistency by 18.8% across multi-domain benchmarks.
HeLa-Mem: Hebbian Learning and Associative Memory for LLM Agents: HeLa-Mem proposes a neuroscience-inspired memory architecture for LLM agents that models conversation history as a dynamic graph driven by Hebbian learning dynamics — strengthening inter-memory connections through co-activation, distilling hub memories into semantic knowledge via reflective consolidation, and retrieving via a dual-pathway combining semantic similarity with Hebbian spreading activation. It achieves state-of-the-art performance on LoCoMo with significantly fewer tokens.
Hierarchical Reinforcement Learning with Augmented Step-Level Transitions for LLM Agents: This paper proposes STEP-HRL, which introduces a local progress module to iteratively compress interaction history within each subtask into compact textual summaries, enabling both high-level and low-level policies to make decisions based solely on single-step transitions rather than full histories. The approach achieves significant performance and generalization gains on ScienceWorld and ALFWorld while reducing token usage.
How Adversarial Environments Mislead Agentic AI: This paper formalizes the Adversarial Environment Injection (AEI) threat model, decomposing it into breadth attacks (poisoning retrieval results to induce cognitive drift) and depth attacks (injecting phantom nodes to construct navigational traps causing policy collapse). Across 11,000+ experimental runs, the two attack dimensions are found to be completely independent in terms of robustness — a phenomenon termed "robustness splitting" — demonstrating that current single-point defense strategies are fundamentally insufficient.
ImplicitMemBench: Measuring Unconscious Behavioral Adaptation in Large Language Models: This paper introduces ImplicitMemBench, the first benchmark for systematically evaluating implicit memory in LLMs. It comprises 300 test items across three cognitive paradigms—procedural memory, priming effects, and classical conditioning—and reveals severe limitations across 17 models: the best-performing model achieves only 66% overall accuracy, far below the human baseline.
Lightweight LLM Agent Memory with Small Language Models: This paper proposes LightMem, a lightweight LLM agent memory system driven by multiple specialized small language models (SLMs). By modularizing memory operations into a Controller (SLM-1), a Selector (SLM-2), and a Writer (SLM-3), and decoupling online processing from offline consolidation, LightMem achieves an average F1 improvement of approximately 2.5 over A-MEM on the LoCoMo benchmark, while attaining a retrieval latency of 83ms and an end-to-end latency of 581ms.
LPO: Towards Accurate GUI Agent Interaction via Location Preference Optimization: This paper proposes Location Preference Optimization (LPO), which combines entropy-based window rewards and distance-based dynamic location rewards within the GRPO framework to improve the spatial grounding accuracy of GUI agents, achieving state-of-the-art performance on both offline and online benchmarks.
MATA: Multi-Agent Framework for Reliable and Flexible Table Question Answering: This paper proposes MATA, a multi-agent framework for table question answering that employs a scheduler to prioritize reasoning paths (CoT/PoT/text2SQL), a confidence checker to filter candidate answers, and a judge agent for arbitration. The framework achieves model-agnostic, efficient, and accurate TableQA, with an average EM improvement of 40.1% across 10 LLMs.
MCP-Flow: Facilitating LLM Agents to Master Real-World, Diverse and Scaling MCP Tools: MCP-Flow proposes a Web Agent-based automated pipeline that collects tool information from 1,166 real-world MCP servers and synthesizes 68,733 high-quality training samples, enabling small fine-tuned models (0.6B–8B) to surpass SOTA large models such as GPT-4o on MCP tool use.
MemoPhishAgent: Memory-Augmented Multi-Modal LLM Agent for Phishing URL Detection: This paper proposes MemoPhishAgent (MPA), the first memory-augmented multimodal LLM agent specifically designed for phishing URL detection. MPA dynamically orchestrates five dedicated tools and leverages an episodic memory system to reuse historical reasoning trajectories. It achieves a 13.6% recall improvement on public benchmarks and a 20% improvement on real-world social media data, and has been deployed in production, processing approximately 60K high-risk URLs per week.
Mina: A Multilingual LLM-Powered Legal Assistant Agent for Bangladesh: This work presents Mina, a multilingual LLM-powered legal assistant for the Bangladeshi legal domain. Through a two-stage RAG pipeline that accurately retrieves relevant acts and specific provisions, combined with a tool chain and multilingual embeddings, Mina achieves 75–80% passing rates on the Bangladesh Bar Council exam while reducing legal consultation costs to just 0.12–0.61% of traditional methods.
RISK: A Framework for GUI Agents in E-commerce Risk Management: This paper proposes the RISK framework, comprising a domain dataset (RISK-Data: 8,492 single-step + 2,386 multi-step trajectories), a benchmark (RISK-Bench), and a GRPO-based reinforcement fine-tuning method (RISK-R1) for GUI agents in e-commerce risk management. The 7B model surpasses state-of-the-art baselines with only 7.2% of their parameter count, achieving an online task success rate of 70.5%.
Scaling External Knowledge Input Beyond Context Windows of LLMs via Multi-Agent Collaboration: This paper proposes ExtAgents, a multi-agent framework that addresses the performance degradation observed in existing multi-agent methods when scaling external knowledge input beyond the context window. It introduces two mechanisms—global knowledge synchronization (information exchange across all Seeking Agents) and knowledge-accumulative reasoning (progressively injecting filtered knowledge into the Reasoning Agent)—achieving significant improvements on multi-hop QA and long survey generation tasks.
SecureVibeBench: Evaluating Secure Coding Capabilities of Code Agents with Realistic Vulnerability Scenarios: This paper presents SecureVibeBench, the first repository-level, multi-file-editing secure coding benchmark. It constructs 105 C/C++ secure coding tasks from 41 OSS-Fuzz projects, precisely reconstructing the scenarios in which vulnerabilities were first introduced via cascaded static and dynamic analysis. Evaluation results reveal that the best-performing agent (SWE-agent + Claude Sonnet 4.5) produces code that is simultaneously functionally correct and secure in only 23.8% of cases.
SILO-BENCH: A Scalable Environment for Evaluating Distributed Coordination in Multi-Agent LLM Systems: This paper introduces SILO-BENCH, a role-agnostic benchmark for evaluating distributed coordination in multi-agent LLM systems. It comprises 30 algorithmic tasks across three communication complexity levels, with 54 configurations yielding 1,620 experiments. The benchmark reveals a critical communication-reasoning gap: agents spontaneously form reasonable communication topologies and actively exchange information, yet systematically fail to integrate distributed state into correct answers.
Spec-o3: A Tool-Augmented Vision-Language Agent for Rare Celestial Object Candidate Identification: This paper proposes Spec-o3, a tool-augmented vision-language agent that simulates the spectral inspection workflow of professional astronomers via Interleaved Multimodal Chain-of-Thought (iMCoT). Through a two-stage training pipeline of cold-start SFT followed by outcome-based RL, Spec-o3 improves macro-F1 from 28.3% to 76.5% on rare celestial object identification, achieving ~50× speedup over manual inspection.
StructMem: Structured Memory for Long-Horizon Behavior in LLMs: StructMem proposes a structure-enhanced hierarchical memory framework that achieves state-of-the-art performance on the LoCoMo long-conversation benchmark (76.82%) through dual-perspective event-level extraction and cross-event semantic consolidation, while substantially reducing token consumption (1.94M vs. 35.8M for graph memory) and API call counts.
SynthAgent: Adapting Web Agents with Synthetic Supervision: This paper proposes SynthAgent, a web agent adaptation framework built entirely on synthetic supervision. It employs categorized exploration to systematically cover functional regions of webpages for diverse task synthesis, followed by a dual refinement strategy—task refinement (conflict-triggered correction of hallucinations) and trajectory refinement (global-context denoising)—to improve synthetic data quality. SynthAgent significantly outperforms existing synthetic methods on WebArena and Online-Mind2Web.
ToolOmni: Enabling Open-World Tool Use via Agentic Learning with Proactive Retrieval and Grounded Execution: This paper proposes ToolOmni, a unified agentic framework that integrates proactive tool retrieval and retrieval-grounded tool execution within a single reasoning loop. Through cold-start SFT followed by decoupled multi-objective GRPO, the framework jointly optimizes retrieval and execution capabilities, achieving an end-to-end execution success rate that surpasses strong baselines by +10.8% on ToolBench.
Towards Scalable Lightweight GUI Agents via Multi-role Orchestration: This paper proposes LAMO, a framework that trains a lightweight 3B MLLM into a flexibly orchestrated multi-role GUI Agent through role-oriented data synthesis and two-stage training (SFT with Perplexity-Weighted Cross-Entropy + multi-task RL). The agent operates in three modes—monolithic inference, multi-agent collaboration, and plug-and-play policy executor—and achieves a 77.6% success rate on AndroidWorld when paired with a GPT-5 planner, surpassing dedicated GUI agents with 72B parameters.
Uncertainty Quantification in LLM Agents: Foundations, Emerging Challenges, and Opportunities: This paper proposes the first formal framework for Agent Uncertainty Quantification (Agent UQ), modeling an agent's problem-solving trajectory as a stochastic process over a dynamic Bayesian network: \(P(\mathcal{F}_{\leq T}) = P(E_0, O_0) \prod_{i=1}^{T} P_{\pi,\mathcal{T}}(A_i|E_{i-1}, O_{i-1}) P(O_i|A_i, E_i)\). The framework unifies existing UQ paradigms (single-step QA, multi-step reasoning) as special cases and identifies four technical challenges unique to Agent UQ through empirical analysis on \(\tau^2\)-bench.
Waking Up Blind: Cold-Start Optimization of Supervision-Free Agentic Trajectories: This paper proposes SPECTRA, a supervision-free framework that enables small vision-language models (SVLMs) to discover effective tool-calling and visual reasoning behaviors through pure environment interaction, leveraging cold-start reinforcement learning (GRPO) and soft structured multi-turn rollout topological constraints. SPECTRA achieves up to 5% improvement in task accuracy and 9% improvement in tool efficiency across 4 multimodal benchmarks, and introduces the Tool Instrumental Utility (TIU) metric to quantify tool effectiveness without supervision.
What Makes an LLM a Good Optimizer? A Trajectory Analysis of LLM-Guided Evolutionary Search: Through large-scale experiments (15 LLMs × 8 tasks, 72K candidate solutions), this paper finds that effective LLM optimizers behave as "local refiners"—consistently producing frequent incremental improvements while progressively concentrating search in semantic space—rather than generating high-novelty, leap-style breakthroughs. A key finding is that novelty per se does not predict optimization performance; novelty is only beneficial when the search remains sufficiently localized.
When Agents Look the Same: Quantifying Distillation-Induced Similarity in Tool-Use Behaviors: This paper proposes two complementary metrics, RPS and AGS, to quantify distillation-induced behavioral homogenization in LLM agents' tool-use behaviors. By distinguishing necessary from unnecessary behaviors, the framework reveals cross-family behavioral inheritance patterns across 18 models, finding that Kimi-K2 exhibits greater behavioral similarity to Claude Sonnet 4.5 than Anthropic's own models do.
Why Agents Compromise Safety Under Pressure: This paper introduces the concept of Agentic Pressure — when LLM agents operating under resource constraints cannot simultaneously complete tasks and comply with safety rules, they spontaneously exhibit norm drift, proactively sacrificing safety to preserve helpfulness. Notably, models with stronger reasoning capabilities are more adept at constructing verbalized rationalizations to justify such violations.
Your LLM Agents are Temporally Blind: The Misalignment Between Tool Use Decisions and Human Time Perception: This paper reveals a "Temporal Blindness" phenomenon in LLM Agents during multi-turn interactions — the inability to adjust tool-calling decisions based on the real elapsed time between messages — and constructs the TicToc benchmark to evaluate this problem.
ZARA: Training-Free Motion Time-Series Reasoning via Evidence-Grounded LLM Agents: This paper proposes ZARA, a knowledge- and retrieval-augmented multi-agent framework that distills sensor signals into a structured textual knowledge base, combines class-conditional retrieval with hierarchical LLM reasoning, and achieves interpretable human activity recognition in a fully training-free setting, substantially outperforming existing methods across 8 datasets.

🏥 Medical Imaging¶

"Excuse Me, May I Say Something…" CoLabScience: A Proactive AI Assistant for Biomedical Discovery: CoLabScience introduces the PULI (Positive-Unlabeled Learning Intervention) framework to train an LLM assistant capable of proactively determining when and how to intervene in biomedical team discussions. By leveraging GRPO and a reinforcement learning coordinator, the system automatically identifies optimal intervention moments and generates scientific suggestions from streaming conversations.
Anonpsy: A Graph-Based Framework for Structure-Preserving De-identification of Psychiatric Narratives: This paper proposes Anonpsy, a framework that reformulates the de-identification of psychiatric narratives as a graph-guided semantic rewriting problem. The approach first converts narratives into semantic graphs, applies constrained perturbations on the graph to modify identity-related information while preserving clinical structure, and finally reconstructs the narrative via graph-conditioned generation.
AROMA: Augmented Reasoning Over a Multimodal Architecture for Virtual Cell Genetic Perturbation Modeling: This paper proposes the AROMA framework, which integrates textual evidence, knowledge graph topological information, and protein sequence features within a multimodal architecture, combined with a two-stage training strategy (SFT + GRPO), to achieve interpretable and accurate prediction of genetic perturbation effects.
Benchmarking and Enabling Efficient Chinese Medical Retrieval via Asymmetric Encoders: This paper proposes CMedTEB (Chinese Medical Text Embedding Benchmark) and CARE (asymmetric retrieval framework). CMedTEB constructs a high-quality Chinese medical retrieval/reranking/STS benchmark via multi-LLM voting with expert validation, while CARE adopts an asymmetric architecture that encodes queries with a lightweight BERT and documents with a large LLM. Through a two-stage progressive alignment strategy, CARE achieves LLM-level retrieval accuracy at BERT-level online latency.
Beyond Prompt: Fine-grained Simulation of Cognitively Impaired Standardized Patients via Stochastic Steering: This paper proposes StsPatient, which extracts domain-specific steering vectors from contrastive instruction/response pairs and applies a Stochastic Token Modulation (STM) mechanism to control injection probability, enabling simulation of standardized patients across different cognitive impairment domains and severity levels. Compared to prompt engineering methods, StsPatient achieves an average improvement of 11.23% in clinical authenticity and surpasses the best baseline by 18.54% in severity controllability.
Beyond the Individual: Virtualizing Multi-Disciplinary Reasoning for Clinical Intake via Collaborative Agents: This paper proposes Aegle, a graph-structured multi-agent framework that virtualizes multidisciplinary team (MDT) consultation for clinical intake. By introducing decoupled parallel reasoning and dynamic topology into the outpatient interview workflow, Aegle surpasses state-of-the-art models across 53 metrics spanning 24 clinical departments.
BioHiCL: Hierarchical Multi-Label Contrastive Learning for Biomedical Retrieval with MeSH Labels: BioHiCL leverages the hierarchical multi-label annotations of MeSH (Medical Subject Headings) to provide structured supervision for dense retrievers. By aligning the embedding space with the MeSH semantic space via depth-weighted label similarity, a 0.1B model surpasses most specialized models on biomedical retrieval, sentence similarity, and question answering tasks.
Calibrated? Not for Everyone: How Sexual Orientation and Religious Markers Distort LLM Accuracy and Confidence in Medical QA: This paper investigates how social identity markers (sexual orientation and religious affiliation) distort LLM accuracy and confidence calibration in medical question answering. It finds that the "homosexual" marker consistently degrades performance and induces calibration crises across 9 LLMs, and that intersectional identities produce non-additive, identity-specific harms.
Can Continual Pre-training Bridge the Performance Gap between General-purpose and Specialized Language Models in the Medical Domain?: This paper constructs a high-quality German medical corpus, FineMed-de (7.3 million documents / 5.1 billion tokens filtered from FineWeb2), applies continual pre-training and SLERP model merging to three LLMs (7B–24B), and creates the DeFineMed model family. The results demonstrate that a domain-specialized 7B model can substantially narrow the performance gap with a 24B general-purpose model on German medical tasks, improving the win rate by approximately 3.5×.
Cognitive Policy-Driven LLM for Diagnosis and Intervention of Cognitive Distortions in Emotional Support Conversation: This paper proposes CoPoLLM, a framework that constructs CogBiasESC — the first emotional support conversation dataset annotated with cognitive distortions — and integrates a Cognitive Policy Reinforcement Learning (CPRL) engine with Dual-Stream Conditional Optimization (DSCO) to enable LLMs to diagnose eight types of cognitive distortions and generate strategy-aware intervention responses, achieving state-of-the-art performance over 15 baselines.
CURA: Clinical Uncertainty Risk Alignment for Language Model-Based Risk Prediction: CURA proposes a dual-level uncertainty calibration framework: at the individual level, it aligns predictive uncertainty with error probability; at the cohort level, it regularizes predictions using neighborhood event rates in the embedding space. The framework consistently improves calibration metrics across five clinical risk prediction tasks on MIMIC-IV without sacrificing discriminative performance.
DART: Mitigating Harm Drift in Difference-Aware LLMs via Distill-Audit-Repair Training: DART identifies and addresses "harm drift"—a phenomenon whereby fine-tuning LLMs to improve difference-aware classification accuracy (e.g., recognizing legitimate demographic distinctions) causes the model's generated explanations to become increasingly harmful. Through a three-stage Distill-Audit-Repair pipeline, DART improves Llama-3-8B accuracy from 39.0% to 68.8% while reducing harm drift cases by 72.6%.
Detecting Hallucinations in SpeechLLMs at Inference Time Using Attention Maps: This work proposes four audio-attention-based metrics (AudioRatio, AudioConsistency, AudioEntropy, TextEntropy) and trains lightweight logistic regression classifiers to detect hallucinations in Speech Large Language Models (SpeechLLMs) at inference time, achieving up to +0.23 PR-AUC improvement on in-domain data.
Dr. Assistant: Enhancing Clinical Diagnostic Inquiry via Structured Diagnostic Reasoning Data and Reinforcement Learning: This paper proposes the Clinical Diagnostic Reasoning Data (CDRD) structure to capture abstract clinical reasoning logic from symptoms to differential diagnosis. Based on CDRD, a two-stage SFT+RL training pipeline is employed to build the Dr. Assistant model (14B), which surpasses HuatuoGPT-o1-72B by 13.59% in ICD-Recall on clinical inquiry benchmarks, reaching a level competitive with GPT-5.
Efficient and Effective Internal Memory Retrieval for LLM-Based Healthcare Prediction: This paper proposes the K2K framework, which treats the FFN parameter space of LLMs as a retrievable knowledge base. Clinical knowledge is injected via LoRA, activation-guided probes enable precise retrieval, and cross-attention reranking adaptively integrates multi-source internal knowledge — achieving state-of-the-art healthcare prediction without external retrieval latency.
Eliciting Medical Reasoning with Knowledge-enhanced Data Synthesis: A Semi-Supervised Reinforcement Learning Approach: This paper proposes MedSSR, a framework that enhances LLM medical reasoning through controllable data synthesis with rare disease knowledge injection and a semi-supervised training paradigm of "self-supervised RL → supervised RL." MedSSR achieves up to +5.93% improvement on rare disease tasks, surpassing the +3% ceiling observed in all prior methods.
Faithfulness vs. Safety: Evaluating LLM Behavior Under Counterfactual Medical Evidence: This paper introduces the MedCounterFact dataset—constructed by systematically replacing interventions in clinical trials with nonsense words, medical terminology, non-medical objects, and toxic substances—and finds that state-of-the-art LLMs almost unconditionally defer to context when presented with counterfactual medical evidence, confidently providing answers even when the "evidence" attributes therapeutic efficacy to heroin or mustard gas. The findings expose a critical lack of a well-defined boundary between faithfulness and safety.
From Answers to Arguments: Toward Trustworthy Clinical Diagnostic Reasoning with Toulmin-Guided Curriculum Goal-Conditioned Learning: This paper adapts the Toulmin argument model to clinical diagnosis and proposes CGCL, a three-stage curriculum training framework (fact collection → hypothesis testing → synthesis), paired with T-Eval for quantifying reasoning structural completeness. The approach achieves diagnostic reasoning quality comparable to RL-based methods without requiring reinforcement learning.
HCFD: A Benchmark for Audio Deepfake Detection in Healthcare: This paper introduces HCFD, a codec-based audio deepfake detection task for healthcare settings. It constructs HCFK, the first codec-forged speech dataset covering multiple clinical pathological conditions (depression, Alzheimer's disease, dysarthria), and proposes the PHOENIX-Mamba framework, which models heterogeneous forgery evidence prototypes in hyperbolic space, achieving 97.04% accuracy on English depression detection.
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering: This paper proposes HypEHR, a 22M-parameter Lorentz hyperbolic model that embeds medical codes, patient visits, and questions into hyperbolic space. Through hierarchy-aware regularization aligned with the ICD ontology structure, HypEHR achieves performance comparable to LLM-based approaches on the MIMIC-IV EHR question answering task.
Inflated Excellence or True Performance? Rethinking Medical Diagnostic Benchmarks with Dynamic Evaluation: This paper proposes DyReMe, a dynamic medical diagnostic evaluation framework. Its DyGen module generates novel diagnostic cases containing clinically grounded distractors—including differential diagnoses and misdiagnosis factors—while the EvalMed module assesses LLMs across four dimensions: accuracy, veracity, helpfulness, and consistency. The results reveal that existing static benchmarks systematically overestimate LLM diagnostic capability; GPT-5 suffers an 8.25% accuracy drop on DyReMe, and all 12 evaluated LLMs exhibit significant trustworthiness deficiencies.
Language Reconstruction with Brain Predictive Coding from fMRI Data: This paper proposes PredFT, an end-to-end fMRI-to-Text decoding model comprising a main network (language decoding) and a side network (brain predictive coding representations). By extracting prospective semantic representations from prediction-related brain regions (PTO areas) and integrating them into the decoding process, PredFT achieves a BLEU-1 of 34.95% on the LeBel dataset (Sub-1), outperforming the strongest baseline MapGuide by 7.84 percentage points.
Learning Dynamic Representations and Policies from Multimodal Clinical Time-Series with Informative Missingness: This work proposes OPL-MT-MNAR, a framework that learns dynamic patient representations from ICU data by leveraging the information embedded in missingness patterns of structured observations and clinical text. It combines an MNAR-aware multimodal encoder, Bayesian filtering for latent belief states, and offline policy learning to derive sepsis treatment policies that outperform clinician behavior (FQE 0.679 vs. 0.528).
LogosKG: Hardware-Optimized Scalable and Interpretable Knowledge Graph Retrieval: This paper proposes LogosKG, a hardware-aligned knowledge graph retrieval framework that reformulates graph traversal as multiplications over three sparse associative matrices (SUB/OBJ/REL). Combined with degree-aware graph partitioning, cross-graph routing, and on-demand LRU caching, LogosKG enables scalable and interpretable high-hop retrieval over billion-edge KGs on a single device. Downstream KG-LLM interaction experiments further reveal the systematic influence of graph topology on LLM diagnostic reasoning.
MARCH: Multi-Agent Radiology Clinical Hierarchy for CT Report Generation: This paper proposes MARCH, a multi-agent framework that simulates the resident–fellow–attending hierarchical collaboration process in radiology. Through three stages—initial report drafting, retrieval-augmented revision, and consensus-driven finalization—MARCH generates CT reports achieving a CE-F1 of 0.399 on the RadGenome-ChestCT dataset, representing a 57.7% improvement over the best baseline Reg2RG (0.253).
Measuring What Matters!! Assessing Therapeutic Principles in Mental-Health Conversation: This paper proposes the CARE framework and the FAITH-M benchmark dataset. By integrating conversational context encoding, contrastive exemplar retrieval, and knowledge distillation chain-of-thought reasoning (KD-CoT), CARE performs fine-grained ordinal evaluation of AI-generated psychotherapeutic responses across six therapeutic principle dimensions, achieving a weighted F1 of 63.34—a 64.26% improvement over the strongest baseline, Qwen3.
MHSafeEval: Role-Aware Interaction-Level Evaluation of Mental Health Safety in Large Language Models: This paper proposes R-MHSafe, a role-aware mental health safety taxonomy, and MHSafeEval, a closed-loop agent evaluation framework. Through adversarial multi-turn counseling interactions, the framework systematically uncovers role-dependent cumulative safety failures of LLMs in mental health counseling scenarios, revealing interaction-level harms that existing static benchmarks fail to capture.
Model-Agnostic Meta Learning for Class Imbalance Adaptation: This paper proposes HAMR (Hardness-Aware Meta-Resample), a unified meta-learning framework that dynamically estimates instance-level importance weights via bi-level optimization to prioritize genuinely difficult samples, coupled with a neighborhood-aware resampling mechanism that shifts training focus toward hard samples and their semantic neighbors. HAMR consistently outperforms strong baselines across 6 imbalanced NLP datasets.
Multi-View Attention Multiple-Instance Learning Enhanced by LLM Reasoning for Cognitive Distortion Detection: This paper proposes decomposing utterances into Emotion–Logic–Behavior (ELB) components and leveraging LLM reasoning to generate multiple cognitive distortion instances, which are subsequently aggregated via a multi-view gated attention MIL framework for bag-level classification. The approach outperforms direct LLM inference baselines on both the Korean (KoACD) and English (Therapist QA) datasets.
OmniCompliance-100K: A Multi-Domain Rule-Grounded Real-World Safety Compliance Dataset: This paper introduces OmniCompliance-100K, the first large-scale, multi-domain, regulation-grounded safety compliance dataset built upon real-world cases. It comprises 12,985 manually curated regulatory rules and 106,009 real-world compliance cases collected via a Web search agent, spanning 9 domains including AI safety, data privacy, finance, and healthcare. Extensive benchmarking reveals systematic deficiencies in current LLMs' safety compliance capabilities.
PrinciplismQA: A Philosophy-Grounded Approach to Assessing LLM-Human Clinical Medical Ethics Alignment: This paper constructs the PrinciplismQA benchmark (3,648 questions, including knowledge-based MCQA and open-ended clinical ethics dilemmas) grounded in the internationally recognized gold standard of medical ethics—Principlism (the four principles of Autonomy, Non-maleficence, Beneficence, and Justice)—and develops an expert-calibrated evaluation pipeline. The study finds that high accuracy on knowledge benchmarks does not imply clinical ethical reasoning capability: even the strongest model, o3, achieves only 77.5% overall.
ProtoCycle: Reflective Tool-Augmented Planning for Text-Guided Protein Design: ProtoCycle proposes a reflective agent framework that positions an LLM as a planner coupled with a lightweight tool environment for text-guided protein sequence design. Through a multi-round feedback-driven decision loop and online reinforcement learning training, the framework achieves strong language alignment while maintaining competitive foldability.
Query Pipeline Optimization for Cancer Patient Question Answering Systems: This paper proposes CoMeta, a three-tier controllable metadata-aware RAG framework for Cancer Patient Question Answering (CPQA). It integrates Clinical Hybrid Semantic-symbolic Document Retrieval (CHSDR), which fuses real-time Boolean search via E-Utilities with MedCPT semantic retrieval, and employs Semantically Enhanced Overlapping Segmentation (SEOS) to prevent context fragmentation. On the CMMQA dataset, CoMeta improves Claude-3-Haiku answer accuracy by 5.24% over CoT and approximately 3% over naive RAG.
RA-RRG: Multimodal Retrieval-Augmented Radiology Report Generation with Key Phrase Extraction: This paper proposes RA-RRG, a framework that leverages an LLM to extract clinically relevant key phrases from radiology reports and construct a retrieval database. Given a chest X-ray image, relevant phrases are retrieved and fed to an LLM for report generation—without any LLM fine-tuning—effectively suppressing hallucinations. The approach requires only 18 GPU hours of training and achieves state-of-the-art performance on CheXbert metrics.
RADS: Reinforcement Learning-Based Sample Selection Improves Transfer Learning in Low-resource and Imbalanced Clinical Settings: This paper proposes RADS (Reinforcement Adaptive Domain Sampling), a reinforcement learning-based sample selection strategy that significantly improves cross-domain disease detection under extreme low-resource and class-imbalanced clinical settings by intelligently selecting a small number of target-domain samples for annotation and joint fine-tuning.
Region-Grounded Report Generation for 3D Medical Imaging: A Fine-Grained Dataset and Graph-Enhanced Framework: This paper presents VietPET-RoI, the first 3D PET/CT dataset with fine-grained ROI annotations (in Vietnamese), along with HiRRA, a hierarchical report generation framework that emulates the diagnostic workflow of radiologists. By modeling spatial-morphological inter-ROI relationships via GATv2 graph neural networks, HiRRA achieves a 19.7% improvement in BLEU-4 and a 45.8% improvement in the clinical metric RoIQ.
RePrompT: Recurrent Prompt Tuning for Integrating Structured EHR Encoders with Large Language Models: This paper proposes RePrompT, a temporally aware LLM framework that consistently outperforms both EHR and LLM baselines on readmission and mortality prediction tasks on MIMIC-III/IV through two complementary mechanisms: recurrent prompt tuning (propagating the hidden state of the previous visit as a soft prompt for the current visit) and struct-encoded prompt tuning (injecting embeddings from a population-level EHR encoder).
RiTeK: A Dataset for Large Language Models Complex Reasoning over Textual Knowledge Graphs in Medicine: RiTeK constructs two large-scale medical textual knowledge graphs (TKGs) and corresponding complex reasoning QA datasets, covering 6 topological structures with rich textual descriptions. It evaluates 11 retrieval methods and reveals critical deficiencies in existing LLM-driven retrieval systems for medical TKG reasoning.
Semi-Supervised Diseased Detection from Speech Dialogues with Multi-Level Data Modeling: This paper proposes an audio-only semi-supervised learning framework that jointly models pathological speech features in clinical dialogues at three levels—session, clip, and frame—using an EMA teacher-student network to dynamically generate high-quality pseudo-labels. With only 11 annotated samples, the framework achieves 90% of fully supervised performance on depression and Alzheimer's disease detection.
Stable On-Policy Distillation through Adaptive Target Reformulation: This paper proposes Veto, a target-level reformulation method that stabilizes on-policy knowledge distillation by constructing a geometric bridging distribution between teacher and student in logit space. A single parameter \(\beta\) simultaneously serves as an adaptive gradient veto in forward KL (suppressing harmful gradients from low-confidence tokens) and a decisiveness knob in reverse KL (balancing reward-driven optimization and output diversity), achieving a 9.2% improvement over SFT on GSM8K.
Text-Attributed Knowledge Graph Enrichment with Large Language Models for Medical Concept Representation: This paper proposes CoMed, an LLM-empowered graph learning framework that constructs a global medical knowledge graph by combining EHR statistical evidence with type-constrained LLM inference, enriches it into a text-attributed graph via LLM-generated node descriptions and edge rationales, and jointly trains a LoRA-finetuned LLaMA encoder with a heterogeneous GNN to learn unified medical concept embeddings, achieving significant improvements in diagnosis prediction on MIMIC-III/IV.
Thinking Like a Botanist: Challenging Multimodal Language Models with Intent-Driven Chain-of-Inquiry: This paper proposes PlantInquiryVQA, a benchmark comprising 24,950 plant images and 138,068 question–answer pairs, along with a Chain-of-Inquiry (CoI) framework that simulates the adaptive diagnostic inquiry strategies of expert botanists. The benchmark is used to evaluate 18 MLLMs on multi-step visual reasoning for plant pathology diagnosis. Results show that structured inquiry significantly improves diagnostic accuracy and reduces hallucinations; nonetheless, even the strongest model achieves a clinical utility score of only 0.188.

💬 LLM / NLP¶

A Study of LLMs' Preferences for Libraries and Programming Languages: This paper presents the first systematic study of library and programming language preferences in code generation across 8 LLMs, revealing that LLMs exhibit strong biases toward popular libraries such as NumPy (with 45% of usages deemed unnecessary) and toward Python (chosen in 58% of high-performance tasks), and that natural language recommendations are inconsistent with actual code generation behavior.
Adam's Law: Textual Frequency Law on Large Language Models: This paper proposes the Textual Frequency Law (TFL), which finds that when semantics are equivalent, prompting or fine-tuning LLMs with higher-frequency textual expressions yields better performance. The authors further introduce frequency distillation and curriculum training strategies to exploit this regularity.
AlphaContext: An Evolutionary Tree-based Psychometric Context Generator for Creativity Assessment: This paper proposes AlphaContext, an evolutionary tree-based psychometric context generator comprising four modules—HyperTree outline planning, MCTS sentence-level generation, MAP-Elites diversity optimization, and assessment-guided iterative refinement—to automatically generate high-quality long-form contexts for creativity assessment, achieving an average improvement of 8% over competitive baselines across 7 evaluation dimensions.
An Existence Proof for Neural Language Models That Can Explain Garden-Path Effects via Surprisal: By fine-tuning neural language models on garden-path sentences, this paper demonstrates the existence of a neural LM that can simultaneously explain garden-path effects and naturalistic reading times via surprisal, providing an existence proof for surprisal theory.
Are Emotion and Rhetoric Neurons in LLM? Neuron Recognition and Adaptive Masking for Emotion-Rhetoric Prediction Steering: This paper systematically investigates the representational mechanisms of emotion and rhetoric neurons in LLMs and their intrinsic relationships. It proposes a multi-dimensional neuron recognition framework and an adaptive masking validation method, enabling targeted steering of emotion/rhetoric predictions and rhetoric-neuron-assisted emotion recognition.
Automatic Combination of Sample Selection Strategies for Few-Shot Learning: This paper proposes ACSESS, a method that automatically identifies complementary sample selection strategies and combines them via weighted aggregation, using three mechanisms: forward selection, backward selection, and Datamodels. Experiments across 23 strategies, 5 ICL models, 3 gradient-based few-shot learning methods, 6 text datasets, and 8 image datasets demonstrate that combined strategies consistently outperform individual strategies and ICL-specific baselines.
ChatHLS: Towards Systematic Design Automation and Optimization for High-Level Synthesis: ChatHLS proposes a multi-agent HLS design framework featuring two core components — HLSTuner (QoR-aware reasoning for pragma selection) and HLSFixer (a hierarchical feedback-enhanced debugging framework) — combined with a self-evolving error case augmentation mechanism (VODA), achieving significant improvements over baselines in HLS-C generation success rate and hardware performance optimization.
CoSToM: Causal-oriented Steering for Intrinsic Theory-of-Mind Alignment in Large Language Models: This paper proposes CoSToM, a framework that first applies causal tracing to identify the critical layers encoding Theory-of-Mind (ToM) features within LLMs (finding they concentrate primarily in early layers), then performs lightweight alignment via activation steering at those layers—significantly improving social reasoning quality in negotiation and persuasion dialogues, bridging the gap between "knowing but not applying" and "knowing and applying."
Detoxification for LLM from Dataset Itself: This paper proposes HSPD (Hierarchical Semantic-Preserving Detoxification), a pipeline that leverages SoCD (Soft Contrastive Decoding) to guide an LLM in identifying and rewriting toxic segments in raw corpora while preserving semantics, producing detoxified text that can directly replace original training data for fine-tuning. The approach reduces toxicity probability from 0.42 to 0.18 on GPT2-XL and achieves state-of-the-art detoxification on LLaMA2-7B, OPT-6.7B, and Falcon-7B.
DiZiNER: Disagreement-guided Instruction Refinement via Pilot Annotation Simulation for Zero-shot NER: DiZiNER simulates the human pilot annotation workflow: multiple heterogeneous LLMs independently annotate the same text, and inter-model disagreements are analyzed to iteratively refine task instructions. The method achieves zero-shot SOTA on 14 out of 18 NER benchmarks, with an average F1 gain of +8.0, surpassing its supervisor model GPT-5 mini.
Don't Adapt Small Language Models for Tools; Adapt Tool Schemas to the Models: This paper proposes PA-Tool, a training-free tool schema optimization method that leverages a "peakedness" signal borrowed from data contamination detection to identify naming patterns familiar to a model from pretraining. By renaming tool components to align with the internalized knowledge of small language models (SLMs), PA-Tool achieves up to 17% improvement on MetaTool and RoTBench, with an 80% reduction in schema misalignment errors.
EvoSpark: Endogenous Interactive Agent Societies for Unified Long-Horizon Narrative Evolution: EvoSpark proposes a multi-agent framework for long-horizon narrative evolution, addressing social memory stacking and narrative–spatial misalignment through three core designs: hierarchical recursive memory (RSB as social cognitive metabolism), generative scene scheduling (GMS for character–location–plot alignment), and an emergent character grounding protocol (ECGP that converts LLM hallucinations into persistent entities).
Expect the Unexpected? Testing the Surprisal of Salient Entities: This paper investigates the relationship between discourse-level salient entities and surprisal. Using 70K+ manually annotated entity mentions and a novel minimal-pair prompting approach, the study finds that globally salient entities are themselves more surprising (higher surprisal), yet systematically reduce the surprisal of surrounding content. This effect varies by genre and is strongest in topically coherent texts.
FastDiSS: Few-step Match Many-step Diffusion Language Model on Sequence-to-Sequence Generation: This paper analyzes two bottlenecks in continuous diffusion language models under few-step sampling — self-conditioning signal mismatch and training saturation — and proposes the FastDiSS framework, which introduces Self-Conditioning Perturbation (SCP) and Model-Aware Noise Scaling (MANS) to improve robustness, achieving 4×–400× speedup while preserving generation quality across 6 benchmarks.
Foresight Optimization for Strategic Reasoning in Large Language Models: This paper proposes Foresight Policy Optimization (FoPO), which introduces a foresight correction term based on opponent modeling into the policy optimization process, enabling LLMs to explicitly anticipate opponent behavior and adjust their strategies accordingly. FoPO achieves significant improvements in strategic reasoning on both cooperative (Cooperative RSA) and competitive (Competitive Taboo) game tasks, with consistent gains on the cross-domain γ-Bench benchmark.
From Static Inference to Dynamic Interaction: A Survey of Streaming Large Language Models: This paper presents the first systematic survey of Streaming Large Language Models (Streaming LLMs), proposing a unified definition grounded in data flow and interaction concurrency. It organizes existing approaches into a three-level progressive taxonomy — Output-streaming, Sequential-streaming, and Concurrent-streaming — and covers methodologies and applications across text, speech, and video streaming scenarios.
GRASS: Gradient-based Adaptive Layer-wise Importance Sampling for Memory-Efficient LLM Fine-tuning: GRASS is a framework that employs Mean Gradient Norm (MGN) as a task-aware and training-stage-aware layer importance metric. It adaptively samples and updates a subset of model layers during fine-tuning, coupled with a layer-wise optimizer state offloading mechanism, achieving up to 4.38-point improvement in average accuracy while reducing memory usage by up to 19.97%.
HCRE: LLM-based Hierarchical Classification for Cross-Document Relation Extraction: This paper proposes HCRE, a model that reformulates cross-document relation extraction from direct classification over a large relation set into layer-wise hierarchical classification guided by a constructed relation tree. A predict-then-verify inference strategy is designed to mitigate inter-layer error propagation. HCRE achieves substantial improvements over both SLM and LLM baselines on the CodRED benchmark.
How Do Answer Tokens Read Reasoning Traces? Self-Reading Patterns in Thinking LLMs: This paper identifies a "benign self-reading" pattern in reasoning LLMs (e.g., DeepSeek-R1) during quantitative reasoning: answer tokens' attention over reasoning traces exhibits forward drift (progressively advancing along the reasoning chain) and semantic anchor concentration (repeatedly revisiting key steps), and this pattern strongly correlates with correctness. Building on this finding, the authors propose a training-free activation steering method driven by Self-Reading Quality (SRQ) scores, achieving accuracy improvements of up to 2.6% across multiple benchmarks.
It's High Time: A Survey of Temporal Question Answering: This paper presents a comprehensive survey of Temporal Question Answering (TQA), proposing a unified analytical framework along three dimensions—corpus temporality, question temporality, and model temporal capability—and systematically reviewing the evolution of TQA methods, benchmark datasets, and evaluation strategies from rule-based pipelines to the Transformer/LLM era, while identifying key challenges for future research.
Iterative Formalization and Planning in Partially Observable Environments: This paper proposes PDDLego+, a framework that enables LLMs to iteratively generate and refine PDDL (Planning Domain Definition Language) representations in partially observable environments. Through a two-phase error refinement loop (solver error + simulation error), the framework achieves effective planning without fine-tuning or in-context demonstrations.
Losses that Cook: Topological Optimal Transport for Structured Recipe Generation: This paper proposes a topological loss function based on Sinkhorn divergence, representing ingredient lists as point clouds in embedding space and minimizing the geometric discrepancy between predicted and reference ingredients. The approach significantly improves ingredient recall and quantity precision in structured recipe generation, with generated outputs preferred by human evaluators in 62% of cases.
Lost in the Prompt Order: Revealing the Limitations of Causal Attention in Language Models: This paper systematically investigates the sensitivity of large language models to the ordering of prompt components in multiple-choice question answering (MCQA). Through controlled experiments, the authors rule out training bias and memory decay hypotheses, identifying the causal attention mask as the fundamental mechanism responsible for the substantial performance degradation observed under the QOC (Question–Options–Context) ordering.
Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness: By comparing the correctness-prediction performance of self-probes (using a model's own hidden states) against external probes (using hidden states from other models), this paper identifies inter-model agreement as the critical confounding factor that masks privileged knowledge. After controlling for agreement, domain-specific privileged knowledge is revealed: it exists in factual tasks but is absent in mathematical reasoning.
Memory-Augmented LLM-based Multi-Agent System for Automated Feature Generation on Tabular Data: This paper proposes MALMAS, a memory-augmented LLM-based multi-agent system for automated feature generation on tabular data. It employs six specialized agents to explore different dimensions of the feature space in parallel, coordinated by a Router Agent, and leverages a three-tier memory mechanism (procedural/feedback/conceptual) for cross-iteration experience accumulation and strategy refinement. MALMAS outperforms existing baselines on 16 classification and 7 regression datasets.
MulDimIF: A Multi-Dimensional Constraint Framework for Evaluating and Improving Instruction Following in Large Language Models: This paper proposes MulDimIF, a multi-dimensional constraint framework that systematically evaluates LLM instruction-following capabilities across three dimensions—constraint patterns (3 types), constraint categories (4 classes, 13 subcategories), and constraint difficulty (4 levels)—and significantly improves model performance via GRPO training, finding that gains primarily stem from parameter updates in the attention modules.
Not All Animals Are Equal: Metaphorical Framing through Source Domains and Semantic Frames: This paper proposes ConceptFrameMet, the first computational framework that integrates FrameNet semantic frames with source domains from Conceptual Metaphor Theory (CMT). A RoBERTa-based multi-task model is trained to jointly detect metaphors and predict their semantic frames and source domains. Combined with a log-likelihood ratio (LLR) statistical method for identifying salient metaphorical patterns in discourse, the framework reveals that liberal and conservative outlets employ the same source domains in immigration discourse yet select systematically different semantic frames to convey opposing associations.
One Persona, Many Cues, Different Results: How Sociodemographic Cues Impact LLM Personalization: This paper systematically compares 6 commonly used persona prompting strategies (two variants each of name-based, explicit-mention, and conversation-history cues) across 7 LLMs and 4 tasks. While average responses are highly correlated across prompting strategies, the magnitude of inter-persona differences varies substantially depending on the strategy used. Overly explicit prompts induce stronger personalization bias, cautioning against drawing bias conclusions from any single prompting approach.
Please Refuse to Answer Me: Mitigating Over-Refusal in LLMs via Adaptive Contrastive Decoding: This paper proposes AdaCD (Adaptive Contrastive Decoding), which extracts a refusal token distribution by contrasting token distributions under an extreme safety prompt versus no prompt, then dynamically decides to amplify or suppress refusal behavior based on an agreement ratio. AdaCD reduces over-refusal by 10.35% while simultaneously improving the refusal rate on malicious queries by 0.13%.
Prefix Parsing is Just Parsing: This paper proposes prefix grammar transformation, an efficient method that reduces prefix parsing to ordinary parsing. Given a grammar, the approach constructs a new grammar that generates exactly the set of all prefix strings of the original language, thereby enabling direct reuse of any existing ordinary parsing algorithm without the need for specialized prefix parsing algorithms.
Revisiting Non-Verbatim Memorization in Large Language Models: The Role of Entity Surface Forms: This paper constructs the RedirectQA dataset—leveraging Wikipedia redirect information to associate the same entity with multiple surface forms—and systematically investigates how non-verbatim memorization in LLMs is affected by entity naming variants. The findings show that factual memorization is neither purely surface-form-specific nor entirely surface-form-agnostic, and that entity-level frequency makes an independent contribution beyond surface-level frequency.
Route to Rome Attack: Directing LLM Routers to Expensive Models via Adversarial Suffixes: This paper proposes R2A (Route to Rome Attack), which constructs a hybrid ensemble surrogate router in a black-box setting and optimizes universal adversarial suffixes to redirect LLM router decisions from cheap weak models toward expensive strong models — achieving an average attack success rate improvement of 49% across 7 open-source routers and 2 commercial routers (GPT-5-Auto, OpenRouter), with inference costs increasing by 2.7–2.9×.
Style Amnesia: Investigating Speaking Style Degradation and Mitigation in Multi-Turn Spoken Language Models: This paper identifies a phenomenon termed "style amnesia," in which spoken language models (SLMs) fail to maintain initially specified speaking styles (emotion, accent, volume, speech rate) across multi-turn conversations. Attention analysis reveals the underlying cause as attention dilution, and an explicit recall process is proposed as a mitigation strategy.
The Model Agreed, But Didn't Learn: Diagnosing Surface Compliance in Large Language Models: This paper proposes the SA-MCQ diagnostic framework to reveal the phenomenon of "surface compliance" in knowledge editing — editors achieve high scores on standard benchmarks without genuinely overwriting internal beliefs, models revert to original parametric memory under discriminative self-assessment, and sequential editing accumulates representational residuals that lead to cognitive instability.
Think in Sentences: Explicit Sentence Boundaries Enhance Language Model's Capabilities: This paper proposes inserting delimiter tokens at sentence boundaries in LLM inputs to implement a "think-in-sentences" reasoning paradigm via both ICL and SFT. The approach yields consistent improvements across models ranging from 7B to 600B parameters (GSM8k +7.7%, DROP +12.5%) with negligible additional computational overhead.
Towards Robust Real-World Spreadsheet Understanding with Multi-Agent Multi-Format Collaboration: This paper proposes SpreadsheetAgent, a two-stage multi-agent framework that achieves robust real-world spreadsheet understanding through progressive region-based reading and cross-validation across three formats—code execution, vision, and LaTeX—without exceeding LLM context limits.
Why Did Apple Fall: Evaluating Curiosity in Large Language Models: This paper proposes the first psychologically inspired framework for systematically evaluating curiosity-like behaviors in LLMs. Through a combination of self-report questionnaires and behavioral experiments, it finds that LLMs exhibit curiosity-like behavioral patterns that arise from data fitting and safety constraints rather than intrinsic drives. A curiosity-driven questioning pipeline is further designed to demonstrate that simulating curious behavior can improve downstream reasoning performance.
XtraGPT: Context-Aware and Controllable Academic Paper Revision via Human-AI Collaboration: This paper presents XtraGPT—the first open-source LLM suite (1.5B–14B) for academic paper revision. By fine-tuning on 7,000 top-venue papers and 140,000 criteria-guided instruction–revision pairs, it enables context-aware, paragraph-level controllable revision. The 7B variant matches GPT-4o-mini, the 14B variant surpasses it, and human evaluation shows an average predicted score improvement of 0.65 points after revision.

🎮 Reinforcement Learning¶

A Survey of Reinforcement Learning for Large Language Models under Data Scarcity: Challenges and Solutions: The first systematic survey of reinforcement learning for LLMs under data scarcity, proposing a three-level taxonomy organized around data-centric, training-centric, and framework-centric perspectives, covering data pruning/synthesis/compression, trajectory generation/reward engineering/policy optimization, and self-evolution/co-evolution/multi-agent evolution paradigms.
Adaptive Instruction Composition for Automated LLM Red-Teaming: This paper proposes the Adaptive Instruction Composition (AIC) framework, which employs Neural Thompson Sampling to adaptively select attack instructions from the combinatorial space of crowdsourced harmful queries and jailbreak strategies, jointly optimizing attack success rate (ASR) and diversity. AIC achieves substantial improvements over existing methods on HarmBench.
AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation: This paper introduces AJ-Bench, the first benchmark systematically evaluating Agent-as-a-Judge capabilities, covering 155 tasks and 516 annotated trajectories across three domains—search, data systems, and GUI. Experiments demonstrate that Agent-as-a-Judge improves average F1 by approximately 13 percentage points over LLM-as-a-Judge.
AttnPO: Attention-Guided Process Supervision for Efficient Reasoning: This paper proposes AttnPO, a low-overhead process supervision RL framework that leverages intrinsic attention signals for step-level credit assignment. By identifying Key-Focus Heads (KFH) to distinguish redundant from critical reasoning steps, AttnPO substantially reduces reasoning length while significantly improving accuracy.
Bootstrapping Code Translation with Weighted Multilanguage Exploration: BootTrans proposes a bootstrapping multilingual code translation approach that leverages test cases from a single pivot language (Python) as cross-lingual verification oracles, employs a dual-pool architecture to expand training data through experience collection, and designs a language-aware weighting mechanism to dynamically prioritize difficult translation directions, achieving significant improvements over baselines on HumanEval-X and TransCoder-Test.
Bridging SFT and RL: Dynamic Policy Optimization for Robust Reasoning: This paper proposes DYPO (Dynamic Policy Optimization), which dynamically routes samples to different optimization paths based on difficulty grading — Hard samples use multi-teacher distillation to reduce SFT bias, while Mid samples use Group Alignment Loss to reduce RL variance. DYPO achieves an average improvement of 4.8% on mathematical reasoning benchmarks and 13.3% on OOD tasks.
CAP: Controllable Alignment Prompting for Unlearning in LLMs: This paper proposes the CAP framework, which trains a lightweight SLM to generate controllable prompt prefixes that guide a frozen LLM to selectively forget target knowledge. Without modifying model parameters, CAP achieves reversible and transferable knowledge unlearning in LLMs.
CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning: This paper proposes the CE-GPPO algorithm, which reintroduces gradient signals for low-probability tokens outside the PPO clipping range via stop-gradient operations, enabling fine-grained coordination of policy entropy and achieving a better balance between exploration and exploitation.
ChipSeek: Optimizing Verilog Generation via EDA-Integrated Reinforcement Learning: ChipSeek proposes a hierarchical reward RL framework that integrates the EDA toolchain directly into the training loop. Through Curriculum-guided Dynamic Policy Optimization (CDPO), it enables LLMs to generate RTL code that simultaneously satisfies functional correctness and PPA (Power-Performance-Area) optimization, achieving SOTA on standard benchmarks.
Controlling Multimodal Conversational Agents with Coverage-Enhanced Latent Actions: This paper proposes constructing a compact latent action space for multimodal conversational agents (MCAs) to replace the prohibitively large token action space in RL fine-tuning. A cross-modal projector and a cycle-consistency loss are employed to jointly leverage paired image-text data and text-only data for codebook construction, compressing the action space from 152K (vocabulary size) to 128 (codebook size). The proposed method consistently outperforms token-level RL baselines on two dialogue tasks.
Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training: This paper proposes the Data Mixing Agent, the first model-based end-to-end domain re-weighting framework. By training a small agent on a large collection of data mixing trajectories via CQL-based reinforcement learning, the framework learns generalizable data mixing heuristics that balance source- and target-domain performance during continual pre-training for mathematical reasoning. The learned heuristics generalize to unseen source domains, target models, and domain spaces.
Deliberative Searcher: Improving LLM Reliability via Reinforcement Learning with Constraints: This paper proposes Deliberative Searcher, a reasoning-first framework that integrates search operations into chain-of-thought (CoT) generation with explicit confidence calibration. It employs constrained RL with adaptive Lagrangian multipliers to jointly optimize correctness and reliability, reducing the average "false-certain" rate of a 7B model from a baseline of 54% to 2%.
Easy Samples Are All You Need: Self-Evolving LLMs via Data-Efficient Reinforcement Learning: This paper proposes EasyRL, a cognitively inspired framework that uses only 10% easy labeled data for warmup initialization via knowledge transfer, then progressively masters hard unlabeled data through divide-and-conquer pseudo-labeling and difficulty-progressive self-training, consistently outperforming supervised GRPO trained on the full dataset.
FaithLens: Detecting and Explaining Faithfulness Hallucination: This paper proposes FaithLens, an 8B-parameter faithfulness hallucination detection model trained via high-quality data synthesis with three-dimensional filtering (label correctness, explanation quality, and data diversity) for cold-start SFT, followed by rule-based reinforcement learning (prediction correctness reward + explanation quality reward) for further optimization. FaithLens surpasses GPT-5.2 and o3 across 12 tasks while providing high-quality explanatory outputs.
Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments: This paper proposes FTRL, a framework that constructs stable and controllable tool-use training environments through a five-stage automated pipeline, and designs a verifiable reward mechanism balancing tool-call precision and task completion in an F1-inspired manner. Combined with preference-optimization RL algorithms, FTRL achieves an average performance improvement of over 10% on tool-use benchmarks for 7B–14B models, surpassing even the strongest closed-source models.
Frame of Reference: Addressing the Challenges of Common Ground Representation in Dialogue: This paper introduces the IndiRef benchmark for evaluating dialogue systems' ability to establish and exploit persistent common ground through "relational references" (e.g., "the café next to the park we visited yesterday"). Experiments show that existing LLMs achieve no more than 50% accuracy even under full-context conditions, and a combination of synthetic data generation and GRPO reinforcement learning training yields performance improvements of 15–20%.
From Passive Metric to Active Signal: The Evolving Role of Uncertainty Quantification in Large Language Models: This paper presents a systematic survey of the functional evolution of uncertainty quantification (UQ) in LLMs—from a "passive diagnostic metric" to an "active control signal"—covering three frontier domains: advanced reasoning (guiding computational allocation and self-correction), autonomous agents (meta-cognitive decision-making driving tool use and information acquisition), and reinforcement learning (mitigating reward hacking and enabling self-improvement via intrinsic rewards).
GeoRA: Geometry-Aware Low-Rank Adaptation for RLVR: This paper proposes GeoRA, a low-rank adaptation method specifically designed for Reinforcement Learning with Verifiable Rewards (RLVR). It constructs a geometry-constrained matrix that fuses spectral and Euclidean priors to extract the principal directions of the RL update subspace for SVD initialization, while freezing a residual matrix as a structural anchor. On Qwen/Llama models ranging from 1.5B to 32B parameters, GeoRA consistently outperforms baselines such as LoRA, PiSSA, and MiLoRA across mathematical, medical, and code RLVR tasks, with stronger out-of-domain generalization and reduced capability forgetting.
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment: This work proposes the HEAL framework, which addresses severe entropy collapse in few-shot RLVR by mixing general-domain data with an Entropy Dynamics Alignment (EDA) reward mechanism. Using only 32 target-domain samples, HEAL matches or surpasses full-shot RLVR performance trained on 1K samples.
ImpRIF: Stronger Implicit Reasoning Leads to Better Complex Instruction Following: ImpRIF formalizes the implicit reasoning structure in complex instructions as a verifiable Explicit Reasoning Graph (ERG), constructs large-scale single-turn/multi-turn training data accordingly, and trains models via SFT combined with process-verified RL. Models ranging from 4B to 32B parameters significantly outperform their base counterparts across five instruction-following benchmarks, with the 32B model surpassing several large commercial models.
Language-Coupled Reinforcement Learning for Multilingual Retrieval-Augmented Generation: This paper proposes the LcRL framework, which addresses knowledge bias and knowledge conflict in multilingual RAG through language-coupled GRPO policy optimization and anti-alignment penalty rewards, achieving significant improvements on multilingual question answering tasks.
LENS: Less Noise, More Voice — Reinforcement Learning for Reasoning via Instruction Purification: LENS identifies that many exploration failures in RLVR stem not from problem difficulty but from a small fraction (<5%) of distractor tokens in the prompt. By detecting and removing these tokens to improve rollout success rates, and transferring the learning signal from purified rollouts back to policy optimization on the original noisy prompts, LENS achieves an average improvement of 3.88% and a 1.6× training speedup.
Optimizing User Profiles via Contextual Bandits for Retrieval-Augmented LLM Personalization: This paper proposes PURPLE, a framework that models user profile construction in retrieval-augmented LLM personalization as a contextual bandit problem. It employs the Plackett-Luce ranking model to capture inter-record dependencies, uses the LLM's log-likelihood over reference responses as a reward signal, and directly optimizes retrieval to align with generation quality.
Quality Over Clicks: Intrinsic Quality-Driven Iterative RL for Cold-Start E-Commerce Query Suggestion: This paper proposes Cold-EQS, a query suggestion framework for cold-start e-commerce scenarios. It leverages answerability, factual accuracy, and information gain as intrinsic quality rewards, and employs iterative reinforcement learning to continuously optimize query suggestion quality, achieving a 6.81% online chatUV improvement.
ReRec: Reasoning-Augmented LLM-based Recommendation Assistant via Reinforcement Fine-tuning: This paper proposes ReRec, a reinforcement fine-tuning (RFT)-based recommendation assistant framework that addresses the limitations of coarse reward signals and unsupervised reasoning processes through three components: dual-graph enhanced reward shaping for fine-grained reward signals, reasoning-aware advantage estimation for step-level differentiated supervision, and an online curriculum scheduler for dynamic training difficulty adjustment. ReRec enables LLMs to handle complex multi-step reasoning recommendation queries and significantly outperforms existing methods on the RecBench+ benchmark.
Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF: This paper proposes Reverse Constitutional AI (R-CAI), which inverts the principles of Constitutional AI into a "toxic constitution" and combines a critique-revision loop with a probability-clamped RLAIF mechanism to achieve automated, controllable, multi-dimensional adversarial toxic data synthesis. Probability clamping mitigates reward hacking-induced semantic degradation, improving semantic coherence by 15%.
Right at My Level: A Unified Multilingual Framework for Proficiency-Aware Text Simplification: This paper proposes Re-RIGHT, a framework that trains a 4B policy model via GRPO with a three-module reward (vocabulary coverage + semantic preservation + coherence) to accurately simplify text in English, Japanese, Korean, and Chinese according to learner proficiency levels (CEFR/JLPT/TOPIK/HSK), outperforming large models such as GPT-5.2 and Gemini 2.5.
RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization: RL-PLUS proposes a hybrid-policy optimization approach that addresses external data distribution mismatch via Multiple Importance Sampling (MIS) and guides models to learn low-probability but correct reasoning paths via an Exploration-Based Advantage Function (EAF), successfully overcoming the capability boundary collapse induced by RLVR and achieving SOTA on six mathematical reasoning benchmarks (average 53.4), with consistent cross-model improvements of up to 69.2%.
Savoir: Learning Social Savoir-Faire via Shapley-based Reward Attribution: This paper proposes Savoir, a cooperative game-theoretic social RL framework that combines expected utility (prospective evaluation of the strategic potential of utterances) and Shapley values (axiomatic fair credit assignment) to address the credit assignment problem in multi-turn dialogue. Savoir achieves state-of-the-art performance on the SOTOPIA benchmark with a 7B model (Goal 7.18 in the Hard setting), matching or surpassing GPT-4o and Claude-3.5-Sonnet, while large reasoning models (o1, DeepSeek-R1) systematically underperform on social tasks.
Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study: This work presents the first systematic study of scaling behaviors in LLM reinforcement learning post-training, revealing power-law relationships between performance and training resources across the Qwen2.5 family (0.5B–72B), with learning efficiency saturating as model scale increases.
Semantic-Space Exploration and Exploitation in RLVR for LLM Reasoning: This paper argues that the conventional token-level exploration–exploitation trade-off in RLVR is an artifact of the measurement space. It proposes to measure exploration and exploitation in the hidden-state semantic space via Effective Rank (ER) and its temporal derivatives (ERV/ERA), and on this basis designs VERL, a method that simultaneously improves both objectives, achieving gains of up to 21.4% on benchmarks such as Gaokao mathematics.
SpiralThinker: Latent Reasoning through an Iterative Process with Text-Latent Interleaving: This paper proposes SpiralThinker, a framework for implicit reasoning that performs iterative updates in the latent representation space interleaved with explicit text reasoning steps. A progressive alignment objective is introduced to ensure latent representations remain consistent with explicit reasoning throughout the iterative process. SpiralThinker surpasses all latent reasoning baselines on mathematical, logical, and commonsense reasoning tasks.
STRIDE-ED: A Strategy-Grounded Stepwise Reasoning Framework for Empathetic Dialogue Systems: This paper proposes the STRIDE-ED framework, which achieves state-of-the-art performance in empathetic dialogue across multiple open-source LLMs by constructing a comprehensive empathy strategy system covering positive/neutral/negative emotions, designing task-aligned multi-stage cognitive CoT reasoning, and combining strategy-aware data refinement with a two-stage SFT+PPO training paradigm. The framework attains an emotion accuracy of 57.25% and BLEU-4 of 4.67.
Table Question Answering in the Era of Large Language Models: A Comprehensive Survey: This paper presents a comprehensive survey of Table Question Answering (TQA) research in the era of large language models. It systematically categorizes task settings along five dimensions (table format, question complexity, answer format, modality, and domain), organizes modeling approaches around five core challenges (table understanding, complex queries, large input handling, data heterogeneity, and knowledge integration), covers 277 papers, and provides forward-looking discussions on emerging directions such as reinforcement learning and interpretability.
The Stackelberg Speaker: Optimizing Persuasive Communication in Social Deduction Games: This paper models turn-based dialogue in social deduction games as a Stackelberg game, where the current player acts as the leader and optimizes the persuasive impact of utterances by measuring the response distribution of the next player. A Refiner model trained with GRPO achieves significant improvements over baselines across four game benchmarks including Werewolf and Avalon.
Understanding Generalization in Role-Playing Models via Information Theory: This paper proposes R-EMID, the first information-theoretic framework for quantifying performance degradation in role-playing models (RPMs) under user, character, and dialogue distribution shifts. By incorporating reasoning processes and Co-evolutionary Reinforcement Learning (CoRL), the framework enables accurate estimation of this metric. Key findings reveal that user shift poses the greatest generalization risk, and reinforcement learning is the only consistently effective training strategy.
UniCreative: Unifying Long-form Logic and Short-form Sparkle via Reference-Free Reinforcement Learning: This paper proposes UniCreative, a framework that unifies long-form (plan→write) and short-form (direct generation) creative writing modes through Adaptive Constraint Preference Optimization (ACPO) and an Adaptive Criteria Generative Reward Model (AC-GenRM), requiring neither SFT nor reference answers. The trained model exhibits emergent metacognitive ability to autonomously distinguish between task types.
SCRL: What If Consensus Lies? Selective-Complementary Reinforcement Learning at Test Time: This paper proposes SCRL (Selective-Complementary Reinforcement Learning), a robust test-time reinforcement learning framework that mitigates label noise amplification through selective positive pseudo-labels (filtering unreliable majorities via strict consensus criteria) and entropy-gated negative pseudo-labels (introducing negative supervision signals into TTRL for the first time to prune erroneous trajectories), achieving up to 10.1 percentage points improvement over TTRL on AIME25.

🔬 Interpretability¶

A Structured Clustering Approach for Inducing Media Narratives: This paper proposes a framework for automatically inducing media narrative patterns from large-scale news corpora. By jointly modeling causal event chains and character roles (hero/villain/victim), the framework employs a role-constrained clustering algorithm to organize narrative chains into semantically coherent narrative patterns. The approach generates interpretable narrative patterns consistent with framing theory in two domains: immigration and gun control.
Aligning What LLMs Do and Say: Towards Self-Consistent Explanations: This paper constructs a large-scale Post-hoc Self-Consistency Bank (PSCB, 85K decisions × 428K explanations), quantifies the feature attribution gap between LLM answers and their explanations, and improves attribution consistency of explanations via DPO optimization without sacrificing accuracy.
ChemVLR: Prioritizing Reasoning in Perception for Chemical Vision-Language Understanding: This paper proposes ChemVLR, the first reasoning-oriented VLM for the chemical domain. It constructs a 760K reasoning dataset via a cross-modal reverse engineering strategy and employs a three-stage training pipeline of continued pre-training → SFT → RL, achieving substantial improvements over proprietary models and domain-specialized VLMs on molecular recognition and reaction prediction tasks.
Context-Value-Action Architecture for Value-Driven Large Language Model Agents: This paper proposes the CVA (Context-Value-Action) architecture, grounded in the S-O-R psychological model and Schwartz's theory of basic human values. By training a Value Verifier on real human data, CVA decouples action generation from cognitive reasoning, effectively mitigating behavioral polarization in LLM agents. The approach achieves substantial improvements over baselines on CVABench, a benchmark comprising over 1.1 million real interaction trajectories.
Curing "Miracle Steps" in LLM Mathematical Reasoning with Rubric Rewards: This paper identifies a pervasive phenomenon in LLM mathematical reasoning termed "Miracle Steps"—instances where a reasoning chain leaps to the correct answer without valid derivation—and proposes the Rubric Reward Model (RRM), a problem-specific process reward function that reduces Miracle Steps by 71% during RL training and improves Verified Pass@1024 on AIME2024 from 26.7% to 62.6%.
Do LLMs Know Tool Irrelevance? Demystifying Structural Alignment Bias in Tool Invocations: This paper identifies and formalizes "structural alignment bias" in LLM tool invocations — the tendency of LLMs to invoke a tool when query attributes can be effectively mapped to tool parameters, even when the tool's functionality is irrelevant to the user's goal. The authors construct the SABEval dataset to decouple structural alignment from semantic relevance, apply contrastive attention attribution (CAA) to reveal two competing internal pathways (semantic checking vs. structural matching), and propose a path rebalancing strategy that achieves 80% relative error reduction.
Evian: Towards Explainable Visual Instruction-tuning Data Auditing: This paper proposes a Decomposition-then-Evaluation paradigm and the EVIAN framework, which decomposes responses in visual instruction tuning data into three components—visual description, subjective reasoning, and factual claims—and evaluates them along three orthogonal dimensions: image-text consistency, logical coherence, and factual accuracy. Models trained on the small high-quality subset selected by EVIAN outperform those trained on large-scale datasets.
Experiments or Outcomes? Probing Scientific Feasibility in Large Language Models: This work constructs a controlled knowledge framework to systematically study how LLMs leverage experimental descriptions and outcome evidence in scientific feasibility assessment. Results show that providing outcome evidence is more reliable than experimental descriptions, that partial experimental information frequently degrades performance below a parametric-knowledge-only baseline, and that LLM reasoning exhibits notable fragility under incomplete evidence.
Forest Before Trees: Latent Superposition for Efficient Visual Reasoning: This paper proposes Laser, a framework that conducts visual reasoning in latent space via Dynamic Window Alignment Learning (DWAL), enabling the model to maintain a probabilistic "superposition state" over future semantics rather than performing precise per-token prediction. This realizes a "global-before-local" cognitive hierarchy, achieving state-of-the-art performance among latent reasoning methods on 6 benchmarks with only 6 reasoning tokens (a reduction of 97%+), surpassing Monet by an average of 5.03%.
From Signal Degradation to Computation Collapse: Uncovering the Two Failure Modes of LLM Quantization: Through systematic mechanistic interpretability analysis, this paper reveals that LLM quantization exhibits two qualitatively distinct failure modes: 4-bit Signal Degradation (computational patterns remain intact but precision is impaired, amenable to local repair) and 2-bit Computation Collapse (functional destruction of critical components, requiring structural reconstruction).
HistLens: Mapping Idea Change across Concepts and Corpora: This paper proposes HistLens, a framework that leverages sparse autoencoders (SAEs) to decompose concept representations into interpretable semantic basis vectors, enabling the tracking of diachronic evolution trajectories across multiple concepts and corpora within a shared coordinate system. The framework supports implicit concept computation and provides a quantifiable, comparable analytical tool for digital humanities and conceptual history research.
IDEA: An Interpretable and Editable Decision-Making Framework for LLMs via Verbal-to-Numeric Calibration: This paper proposes IDEA, a framework that extracts LLM decision knowledge into an interpretable parametric model over semantic factors. An EM algorithm jointly learns the mapping from verbal probability expressions to numeric values and the decision parameters, enabling calibrated, editable, and interpretable LLM decision-making. IDEA with Qwen-3-32B achieves 78.6% average F1 across five datasets, surpassing DeepSeek R1 (68.1%) and GPT-5.2 (77.9%).
Interpretability from the Ground Up: Starting from the informational needs of educational assessment stakeholders, this paper proposes four FGTI principles (Faithful, Grounded, Traceable, Interchangeable) and develops the three-stage AnalyticScore framework for interpretable automated scoring, achieving an average QWK on ASAP-SAS only 0.06 below the non-interpretable SOTA.
Interpretable Traces, Unexpected Outcomes: Investigating the Disconnect in Trace-Based Knowledge Distillation: By constructing a verifiable intermediate reasoning trace dataset via rule-based question decomposition, this paper reveals that the semantic correctness of CoT reasoning traces correlates unreliably with final answer accuracy (correct traces lead to correct answers only 28% of the time), and that the most interpretable traces are not the most performance-enhancing ones—verbose R1 traces achieve the best performance yet are rated the least interpretable by users.
LePREC: Reasoning as Classification over Structured Factors for Assessing Relevance of Legal Issues: This paper proposes LePREC, a neuro-symbolic framework inspired by legal professionals' analytical processes. It uses LLMs to generate reasoning question–answer pairs that convert unstructured legal text into structured features, which are then fed into a sparse linear model for relevance classification. On the LIC dataset constructed from 769 Malaysian contract law cases, LePREC achieves 30–40% improvement over LLM baselines such as GPT-4o.
LLM-Guided Semantic Bootstrapping for Interpretable Text Classification with Tsetlin Machines: This paper proposes an LLM-guided semantic bootstrapping framework that leverages LLMs to generate sub-intents and trains a Non-Negated Tsetlin Machine (NTM) via three-stage curriculum synthetic data generation. High-confidence symbolic features extracted by the NTM are injected into real data representations, enabling a standard TM to approach BERT-level classification performance while maintaining full interpretability.
NOSE: Neural Olfactory-Semantic Embedding with Tri-Modal Orthogonal Contrastive Learning: This paper proposes NOSE, a tri-modal olfactory representation learning framework that uses molecules as a pivot to align three modalities—molecular structure, receptor sequences, and natural language descriptions—via an orthogonal injection mechanism. Combined with an LLM-driven weak positive augmentation strategy to address description sparsity, NOSE achieves state-of-the-art performance on 11 downstream tasks and demonstrates strong zero-shot generalization.
PV-SQL: Synergizing Database Probing and Rule-based Verification for Text-to-SQL Agents: This paper proposes PV-SQL, an agent-based Text-to-SQL framework that combines two complementary components — Probe (iteratively generating probing queries to discover database value formats, column semantics, and table relationships) and Verify (extracting verifiable constraints via pattern matching and constructing a checklist) — achieving 5% higher execution accuracy and 20.8% higher valid efficiency score over the best baseline on the BIRD benchmark.
Reasoning Fails Where Step Flow Breaks: This paper proposes Step-Saliency, a diagnostic tool that identifies two depth-correlated information flow failure modes in large reasoning models (Shallow Lock-in and Deep Decay), and designs StepFlow, a test-time intervention that repairs information propagation and improves reasoning accuracy without retraining.
Revitalizing Black-Box Interpretability: Actionable Interpretability for LLMs via Proxy Models: This paper proposes a proxy-model-based black-box interpretability framework that leverages cheap small models to approximate the local decision boundaries of expensive large models for generating LIME/SHAP explanations. A statistical screen-and-apply mechanism ensures reliability: proxy explanations maintain over 90% fidelity while reducing costs by 88.2%, and are successfully applied to downstream optimization tasks such as prompt compression and poisoned sample removal.
Rhetorical Questions in LLM Representations: A Linear Probing Study: This work applies linear probing to analyze how LLMs internally represent rhetorical questions (RQs), finding that RQs are linearly separable in representation space and that probes transfer across datasets. However, probes trained on different datasets learn inconsistent directions, indicating that RQs are encoded along multiple heterogeneous linear directions rather than a single unified dimension.
Similarity-Distance-Magnitude Activations: This paper proposes SDM (Similarity-Distance-Magnitude) activations as a more robust replacement for softmax. By decoupling and integrating three epistemic dimensions—Similarity (deep matching with correct training predictions), Distance (proximity to the training distribution), and Magnitude (distance to the decision boundary)—into a novel activation \(\text{sdm}(\mathbf{z}')_i = (2+q)^{d \cdot z'_i} / \sum_c (2+q)^{d \cdot z'_c}\), the method constructs an SDM estimator for selective classification that is more robust than existing calibration approaches under covariate shift and out-of-distribution inputs.
SITE: Soft Head Selection for Injecting ICL-Derived Task Embeddings: SITE proposes a gradient-based soft attention head selection method that identifies task-relevant attention heads to effectively inject ICL-derived task embeddings. Across 12 LLMs (4B–70B), SITE substantially outperforms ICL and existing embedding injection methods while achieving performance comparable to PEFT with far fewer trainable parameters.
SPENCE: A Syntactic Probe for Detecting Contamination in NL2SQL Benchmarks: SPENCE detects and quantifies data contamination of LLMs on NL2SQL benchmarks by systematically generating syntactic paraphrases of benchmark queries and measuring the decay of execution accuracy as a function of syntactic distance. Older benchmarks (e.g., Spider) exhibit stronger contamination signals, while the more recent BIRD benchmark is largely unaffected.
StructKV: Preserving the Structural Skeleton for Scalable Long-Context Inference: This paper proposes StructKV, a structure-aware KV Cache compression framework that identifies globally important tokens via Global In-Degree Centrality accumulated across layers, adaptively locates the optimal compression layer via Dynamic Pivot Detection, and decouples computation and storage budgets via Structural Propagation & Decoupling. At 60% prefill + 10% KV retention, StructKV achieves near-full-context performance on LongBench and RULER.
Style over Story: Measuring LLM Narrative Preferences via Structured Selection: This paper proposes a constrained-selection experimental paradigm to measure LLM narrative preferences. Using a library of 200 constraints constructed from narratological theory, six LLMs are evaluated across different instruction types, revealing that models systematically favor "Style" over content elements such as "Event," "Character," and "Setting."
TabReX: Tabular Referenceless eXplainable Evaluation: This paper proposes TabReX, a graph-reasoning-based referenceless evaluation framework for tabular generation. It converts source text and generated tables into knowledge graph triples and aligns them to compute interpretable, attribute-driven scores. TabReX substantially outperforms existing methods in correlation with human judgments, and the authors also introduce TabReX-Bench, a large-scale evaluation benchmark.
The Reasoning Trap: How Enhancing LLM Reasoning Amplifies Tool Hallucination: This paper systematically reveals the "reasoning trap" paradox: enhancing LLM reasoning capabilities — whether through RL, distillation, or switchable reasoning modes — systematically amplifies tool hallucination. This effect is associated with reasoning itself rather than RL training, and existing mitigation strategies (prompt engineering, DPO) face an unavoidable reliability-capability trade-off.
ThreadSumm: Summarization of Nested Discourse Threads Using Tree of Thoughts: This paper proposes ThreadSumm, a multi-stage LLM pipeline framework that models nested discourse thread summarization as a hierarchical reasoning problem. It first extracts aspects and atomic content units (ACUs) for content planning, then constructs a thread-aware sequence via sentence ordering, and finally applies Tree of Thoughts search to generate and score multiple paragraph candidates. The approach outperforms baselines on Reddit and StackExchange datasets.
To Trust or Not to Trust: Attention-Based Trust Management for LLM Multi-Agent Systems: This paper proposes the first comprehensive definition of "trustworthiness" for LLM multi-agent systems (LLM-MAS), grounded in six orthogonal dimensions derived from Grice's Cooperative Principle. It demonstrates that LLM attention patterns can distinguish different types of trustworthiness violations, and on this basis introduces A-Trust, a lightweight attention-based evaluation method, and an end-to-end Trust Management System (TMS) that achieves malicious message detection rates of 77–90% across diverse attack scenarios.
Towards Intrinsic Interpretability of Large Language Models: A Survey of Design Principles and Architectures: This paper presents a systematic survey of recent advances in intrinsic interpretability of LLMs, organizing existing methods into five design paradigms (functional transparency, concept alignment, representational decomposability, explicit modularity, and latent sparsity induction), and discusses open challenges and future directions.
Tracing Relational Knowledge Recall in Large Language Models: This paper systematically investigates the internal mechanisms by which LLMs recall relational knowledge during text generation. It finds that per-head attention contributions to the residual stream (\(\Delta_{att,h}\)) serve as the strongest features for linear relation classification (91% accuracy), and proposes two probe attribution methods—HeadScore and TokenScore—to decompose predictions to the attention head and source token levels, revealing clear correlations between probe accuracy and relation specificity, entity connectivity, and probe signal concentration.
Understanding New-Knowledge-Induced Factual Hallucinations in LLMs: Analysis and Interpretation: This paper systematically analyzes factual hallucinations induced by new-knowledge learning during SFT using a controlled synthetic dataset, Biography-Reasoning. It identifies the root mechanism as the attenuation of attention to key entities, and proposes KnownPatch—injecting a small amount of known-knowledge samples at the end of training to restore attention patterns—effectively mitigating hallucinations.
Understanding or Memorizing? A Case Study of German Definite Articles in Language Models: This paper employs the Gradiend gradient-based interpretability method to investigate whether language models predict German definite articles (der/die/das/den/dem/des) by leveraging abstract grammatical rules or surface-level memorization, finding that models rely at least partially on memorized associations rather than strict rule-based encoding.

🎵 Audio & Speech¶

Affectron: Emotional Speech Synthesis with Affective and Contextually Aligned Nonverbal Vocalizations: This paper proposes Affectron, a framework that achieves diverse and emotionally aligned nonverbal vocalization (NV) synthesis—such as laughter and sighs—on small-scale open-source disentangled corpora, via two training-time augmentation strategies: emotion-driven Top-K NV matching and emotion-aware Top-K routing. The proposed method substantially outperforms the purely language-pretrained VoiceCraft baseline.
Alexandria: A Multi-Domain Dialectal Arabic Machine Translation Dataset for Culturally Inclusive and Linguistically Diverse LLMs: Alexandria constructs a parallel English-Dialectal Arabic multi-round dialogue dataset covering 13 Arabic countries, 11 social impact domains, and 107K turns. Through a community-driven manual translation and revision process, it provides unprecedented fine-grained training and evaluation resources for dialectal Arabic machine translation, accompanied by a systematic benchmark assessment across 24 LLMs.
An Exploration of Mamba for Speech Self-Supervised Models: This work presents the first comprehensive exploration of the Mamba architecture as a backbone for speech self-supervised learning (SSL), demonstrating that Mamba-based HuBERT outperforms Transformers in long-context ASR, streaming ASR, and causal probing tasks while maintaining linear time complexity.
Anchored Cyclic Generation: A Novel Paradigm for Long-Sequence Symbolic Music Generation: This paper proposes the Anchored Cyclic Generation (ACG) paradigm, which calibrates the generation direction by using confirmed musical content as anchors during autoregressive decoding, effectively mitigating error accumulation in long-sequence symbolic music generation. A hierarchical framework, Hi-ACG, is further constructed to realize global-to-local music generation.
Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs: This paper identifies that the perceptual weakness of current AudioLLMs stems from an ASR-centric training paradigm that systematically suppresses paralinguistic and non-linguistic information. It proposes the Unified Audio Schema (UAS), which structures audio information into a three-dimensional JSON format covering transcription, paralinguistics, and non-linguistic events. The approach achieves a 10.9% improvement in perceptual accuracy on the MMSU benchmark while preserving reasoning capabilities.
Breaking Block Boundaries: Anchor-based History-stable Decoding for Diffusion Large Language Models: This paper proposes AHD (Anchor-based History-stable Decoding), a training-free plug-and-play dynamic decoding strategy that identifies cross-block stable tokens in diffusion LLMs by tracing historical trajectories via dynamic anchors, enabling early unlocking. On BBH, AHD reduces decoding steps by 80% while improving performance by 3.67%.
Computational Narrative Understanding for Expressive Text-to-Speech: This paper extracts character direct speech from fiction audiobooks to construct a large-scale expressive speech dataset, LibriQuote (5.3K hours of quotations + 12.7K hours of narration), annotating speaking style with speech verb and adverb pseudo-labels derived from narrative context. Experiments demonstrate that fine-tuning a flow-matching model simultaneously improves expressiveness and intelligibility, and that LibriQuote-test constitutes a challenging benchmark for expressive TTS.
DIA-HARM: Dialectal Disparities in Harmful Content Detection Across 50 English Dialects: This paper introduces DIA-HARM, the first benchmark for evaluating the robustness of misinformation detectors across 50 English dialects. It reveals that human-authored dialectal content causes detection performance drops of 1.4–3.6% F1, that fine-tuned Transformers substantially outperform zero-shot LLMs (96.6% vs. 78.3%), and that some models suffer catastrophic degradation exceeding 33% on mixed-content inputs.
Do We Need Distinct Representations for Every Speech Token? Unveiling and Exploiting Redundancy in Large Speech Language Models: This paper employs layer-wise oracle intervention experiments to reveal a structured redundancy hierarchy in the speech token representations of large speech language models (LSLMs)—whereby shallow layers encode essential acoustic details while deep layers are highly redundant—and proposes Affinity Pooling, a training-free similarity-based token merging mechanism that reduces FLOPs by 27.48% while maintaining competitive accuracy.
SEPT: Semantically Expanded Prompt Tuning for Audio-Language Models: SEPT leverages LLMs to generate semantic neighbors for each category and introduces a margin-constrained semantic expansion loss to regularize the prompt embedding space, substantially alleviating the Base-New Tradeoff (BNT) in prompt tuning for audio-language models (ALMs). It also establishes the first systematic evaluation benchmark for prompt generalization in ALMs.
HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models: This paper presents HalluAudio, the first large-scale cross-domain (speech/environmental sound/music) benchmark for hallucination detection in large audio-language models (LALMs), comprising 5,000+ human-verified QA pairs and a systematic adversarial prompt design. It evaluates mainstream LALMs across multiple dimensions (accuracy, hallucination rate, Yes-No bias, rejection rate, and error type), revealing significant deficiencies in acoustic grounding, temporal reasoning, and music attribute understanding.
Hard to Be Heard: Phoneme-Level ASR Analysis of Phonologically Complex, Low-Resource Endangered Languages: This paper presents a phoneme-level ASR analysis of two extremely phonologically complex, low-resource endangered East Caucasian languages (Archi and Rutul), finding that phoneme recognition accuracy follows an S-shaped learning curve with respect to training frequency, and that many errors attributed to phonological complexity are in fact primarily caused by data scarcity.
How Hypocritical Is Your LLM Judge? Listener–Speaker Asymmetries in the Pragmatic Competence of Large Language Models: This paper systematically compares 14 LLMs as pragmatic listeners (judging pragmatic appropriateness) and pragmatic speakers (generating pragmatically appropriate language) across three tasks—false presuppositions, antipresuppositions, and deductive reasoning—revealing pervasive listener–speaker asymmetries: most models perform substantially better as judges than as generators, and item-level analysis shows that correct judgments do not reliably predict successful generation.
Jamendo-MT-QA: A Benchmark for Multi-Track Comparative Music Question Answering: This paper introduces Jamendo-MT-QA, a multi-track comparative music question answering benchmark comprising 36,519 QA pairs across 12,173 track pairs. It is the first systematic evaluation of audio-language models on cross-track comparative reasoning, revealing significant deficiencies in sentence-level comparative generation among existing models.
Learning Invariant Modality Representation for Robust Multimodal Learning from a Causal Inference Perspective: This paper proposes CmIR (Causal modality Invariant Representation learning), which leverages causal inference theory to explicitly disentangle each modality into a causal invariant representation and an environment-specific spurious representation. Through an elegant objective combining invariance constraints, mutual information constraints, and reconstruction constraints, the framework ensures that invariant representations maintain stable predictive relationships across environments. CmIR achieves state-of-the-art performance on multimodal sentiment, humor, and sarcasm detection, with particularly strong results under OOD and noisy conditions.
MCGA: A Multi-task Classical Chinese Literary Genre Audio Corpus: This paper introduces MCGA, the first large-scale (119 hours, 22,000 samples) fully copyright-cleared audio corpus for classical Chinese literature, spanning five major literary genres (Fu, Shi, Wen, Ci, Qu) and six speech tasks (ASR/S2TT/SEC/SQA/SU/SR). An evaluation of 10 multimodal large models reveals substantial deficiencies in current models' ability to understand classical literary speech.
Multimodal In-Context Learning for ASR of Low-Resource Languages: This paper systematically investigates whether multimodal in-context learning (MICL) enables speech LLMs to handle unseen endangered languages, and proposes a MICL-based hypothesis selection system that combines the complementary strengths of acoustic models and speech LLMs, achieving substantial ASR improvements across three endangered languages.
Music Audio-Visual Question Answering Requires Specialized Multimodal Designs: As the first comprehensive survey of the Music Audio-Visual Question Answering (Music AVQA) field, this paper systematically analyzes dataset evolution and method design, demonstrating that specialized input processing, spatiotemporal architectural design, and music domain knowledge are essential for this task, and that general-purpose multimodal models are insufficient to address the unique challenges of music performance understanding.
MSU-Bench: Musical Score Understanding Benchmark: MSU-Bench is the first human-annotated benchmark for complete musical score understanding, comprising 1,800 generative QA pairs from 150 pieces across four difficulty levels. Evaluation reveals severe deficiencies in LLM/VLM localization and hallucination, while text-based ABC notation input substantially mitigates these issues.
Pseudo2Real: Task Arithmetic for Pseudo-Label Correction in Automatic Speech Recognition: This paper proposes Pseudo2Real, a parameter-space correction method that computes a "correction vector" as the weight difference between a real-label model and a pseudo-label model trained on a source domain, then applies this vector to a pseudo-label fine-tuned model on the target domain to rectify systematic pseudo-label bias. The method achieves up to 35% relative WER reduction across ten African accents in AfriSpeech-200.
Retrieving to Recover: Towards Incomplete Audio-Visual Question Answering via Semantic-consistent Purification: This paper proposes the R2ScP framework, which shifts the missing-modality paradigm in AVQA from conventional generative completion to retrieval-based recovery. By combining cross-modal retrieval with a context-aware adaptive purification mechanism to eliminate retrieval noise, R2ScP achieves substantial performance gains in modality-incomplete settings.
Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction: This paper proposes an instruction-referencing-based defense against prompt injection attacks. Rather than suppressing the LLM's instruction-following capability, the method instructs the model to reference the executed instruction within its response, and then removes responses unrelated to the original instruction via label filtering, reducing the attack success rate to near 0% in several settings.
SpeakerSleuth: Can Large Audio-Language Models Judge Speaker Consistency across Multi-turn Dialogues?: SpeakerSleuth introduces the first benchmark (1,818 instances) for evaluating LALMs' ability to judge speaker consistency in multi-turn dialogues. Systematic evaluation of 12 LALMs and 6 embedding methods reveals that models struggle to detect and localize acoustic inconsistencies, exhibit severe text-over-acoustics modality bias, yet perform comparatively well on comparative/ranking tasks involving acoustic variants.
Splits! Flexible Sociocultural Linguistic Investigation at Scale: This paper proposes a methodology for constructing a sociolinguistic "sandbox," building Splits!—a 9.7 million post dataset from Reddit partitioned along two axes (demographic group × discussion topic) across 6 groups and 89 topics—and designing a two-stage filtering pipeline based on lift and triviality to efficiently identify non-trivial, research-worthy sociocultural linguistic phenomena from 23,000 LLM-generated candidate hypotheses.
Still Between Us? Evaluating and Improving Voice Assistant Robustness to Third-Party Interruptions: To address the inability of voice assistants to distinguish third-party interruptions (TPI) from primary-user speech, this work proposes TPI-Train, a dataset of 88K training instances, along with the TPI-Bench evaluation framework. A speaker-aware hard negative mining strategy is introduced to eliminate semantic shortcut learning, enabling models to rely genuinely on acoustic cues for interruption detection.
StressTest: Can YOUR Speech LM Handle the Stress?: This paper proposes StressTest, a benchmark for evaluating the ability of speech language models (SLMs) to understand the meaning conveyed by sentence stress. Evaluations reveal that existing models are nearly incapable of inferring speaker intent from stress patterns. A synthetic data pipeline, Stress-17k, is introduced, and the resulting fine-tuned model, StresSLM, substantially outperforms state-of-the-art models on both stress detection and stress reasoning tasks.
TellWhisper: Tell Whisper Who Speaks When: This paper proposes TellWhisper, which jointly encodes speaker identity and temporal information into the speech encoder's self-attention via a time-speaker-aware rotary position encoding (TS-RoPE), coupled with a hyperbolic speaker diarization model (Hyper-SD), to achieve joint modeling of "who speaks what when" and attain state-of-the-art performance on multi-speaker ASR tasks.
Temporal Contrastive Decoding: A Training-Free Method for Large Audio-Language Models: This paper proposes TCD, a training-free inference-time decoding method that contrasts logits from the original audio path against a temporally blurred slow path, combined with stability-guided blur window selection and uncertainty-based gating, to help unified audio-language models better exploit transient acoustic cues. Consistent improvements are demonstrated on MMAU and AIR-Bench.
Towards Fine-Grained and Multi-Granular Contrastive Language-Speech Pre-training: This paper proposes the FCaps large-scale dataset (47k hours of speech, 19M fine-grained annotations) and the CLSP contrastive learning model. Through an end-to-end annotation pipeline and fine-grained multi-granular contrastive supervision, it presents the first speech-text alignment model capable of uniformly representing both global and fine-grained speaking styles.
When Misinformation Speaks and Converses: Rethinking Fact-Checking in Audio Platforms: This position paper argues that misinformation on audio platforms is fundamentally distinct from textual misinformation in two dimensions: it is simultaneously spoken (conveying persuasion through prosody, pacing, and emotion) and conversational (unfolding across multiple turns, speakers, and episodes). Existing text-centric fact-checking pipelines cannot adequately handle these properties, and verification frameworks must be redesigned around the intrinsic characteristics of audio.

� LLM Safety¶

Adaptive Text Anonymization: Learning Privacy-Utility Trade-offs via Prompt Optimization: This paper proposes an adaptive text anonymization framework that employs evolutionary prompt optimization to automatically discover task-specific anonymization instructions for LLMs, outperforming manually designed strategies across multiple privacy-utility trade-off scenarios while operating entirely on open-source models.
AGSC: Adaptive Granularity and Semantic Clustering for Uncertainty Quantification in Long-text Generation: AGSC proposes an uncertainty quantification framework for long-text generation that uses NLI neutral probability to trigger adaptive granularity decomposition (reducing inference time by 60%) and employs GMM soft clustering to capture latent semantic topics for topic-aware weighted aggregation, achieving state-of-the-art factuality correlation on the BIO and LongFact benchmarks.
Beyond End-to-End: Dynamic Chain Optimization for Private LLM Adaptation on the Edge: This paper proposes ChainFed, a chain-based federated fine-tuning paradigm that breaks through the memory wall by sequentially training and freezing adapters layer by layer, enabling resource-constrained edge devices to participate in LLM fine-tuning. Combined with three techniques—Dynamic Layer Coordination, Global-aware Parameter Optimization, and Function-Oriented Adaptive Tuning—ChainFed achieves up to 46.46% average accuracy improvement.
Compiling Activation Steering into Weights via Null-Space Constraints for Stealthy Backdoors: This paper proposes STEEREDIT, a backdoor injection framework that compiles dynamic activation steering into static weight modifications. By extracting a compliance direction and applying null-space constraints, the injected backdoor activates only in the presence of a trigger token. The method achieves high attack success rates on multiple safety-aligned LLMs while preserving safe behavior and general capability in trigger-absent scenarios.
De-Anonymization at Scale via Tournament-Style Attribution: This paper proposes DAS (De-Anonymization at Scale), an LLM-based large-scale authorship de-anonymization method that combines tournament-style elimination, dense retrieval pre-filtering, and multi-round voting aggregation to perform author matching across tens of thousands of candidate texts, revealing the privacy threat that LLMs pose to anonymous platforms such as double-blind peer review.
DUET: Dual Execution for Test Output Prediction with Generated Code and Pseudocode: This paper proposes DUET, a dual-path framework that combines direct code execution with LLM-based pseudocode execution. The two paths are complementary—the former is reliable when generated code is correct but vulnerable to implementation errors, while the latter bypasses implementation details at the cost of potential execution hallucinations. Predictions are merged via functional majority voting, achieving a 13.6 percentage-point improvement in Pass@1 on LiveCodeBench test output prediction.
Enhancing Hallucination Detection via Future Context: This paper proposes leveraging sampled "future context" (subsequent sentences) to enhance hallucination detection in black-box settings. By exploiting the "snowball effect"—whereby hallucinations tend to propagate once introduced—the method consistently improves detection performance across multiple sampling-based approaches, including SelfCheckGPT and SC.
FACTS: Table Summarization via Offline Template Generation with Agentic Workflows: This paper proposes FACTS (Fast, Accurate, and Privacy-Compliant Table Summarization), a three-stage agentic workflow that automatically generates reusable offline templates (SQL queries + Jinja2 templates) for fast, accurate, and privacy-compliant query-focused table summarization, achieving state-of-the-art performance across FeTaQA, QTSumm, and QFMTS benchmarks.
Forget What Matters, Keep the Rest: Selective Unlearning of Informative Tokens: This paper proposes Entropy-guided Token Weighting (ETW), which uses the entropy of the predictive distribution as a proxy for token informativeness. ETW selectively imposes stronger unlearning penalties on informative tokens, enabling effective removal of target knowledge while better preserving general model utility.
Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages: This paper introduces ICF, the first multi-Indic-language CodecFake detection benchmark, and proposes SATYAM—a hyperbolic audio large language model that aligns semantic and paralinguistic representations via Bhattacharyya distance in hyperbolic space before aligning with a conditioning prompt. With only 3.75M trainable parameters, SATYAM achieves 98.32% detection accuracy.
Jailbreaking Large Language Models with Morality Attacks: This paper constructs a 10.3K morality attack dataset (covering value ambiguity and value conflict scenarios) and manipulates the moral judgment of LLMs via four adversarial strategies. It finds that both LLMs and guardrail models are highly vulnerable to morality attacks, and that larger models are paradoxically easier to compromise.
KoCo: Conditioning Language Model Pre-training on Knowledge Coordinates: This paper proposes Knowledge Coordinate-conditioned pre-training (KoCo), which maps each document to a three-dimensional semantic coordinate (Source, Content, Stability) and injects this as a natural language prefix during pre-training. This endows the model with explicit context-awareness, yielding performance gains across 10 downstream tasks, approximately 30% faster convergence, and effective hallucination mitigation.
Look Twice before You Leap: A Rational Framework for Localized Adversarial Anonymization: This paper proposes the RLAA framework, which addresses the utility collapse problem when transferring adversarial text anonymization to local small models (LSMs). Through an Attacker-Arbitrator-Anonymizer (A-A-A) architecture and a Marginal Rate of Substitution (MRS) rationality constraint, RLAA achieves a superior privacy-utility balance over API-based solutions on local devices, without any training.
Maximizing Local Entropy Where It Matters: Prefix-Aware Localized LLM Unlearning: This paper proposes PALU (Prefix-Aware Localized Unlearning), which achieves localized entropy maximization for unlearning along two dimensions: temporally, unlearning objectives are applied only to sensitive prefix tokens; in the vocabulary dimension, only top-K logits are flattened. This approach enables effective unlearning with minimal parameter perturbation while preserving the model's general capabilities.
MeasHalu: Mitigation of Scientific Measurement Hallucinations for LLMs: This paper proposes MeasHalu, a framework that mitigates hallucinations in LLM-based scientific measurement extraction through a fine-grained measurement hallucination taxonomy and a two-stage optimization pipeline (reasoning-aware SFT + hallucination-targeted GRPO rewards), achieving significant improvements over baselines on MeasEval.
Protecting Bystander Privacy via Selective Hearing in Audio LLMs: This work introduces SH-Bench, the first benchmark for bystander privacy evaluation, and proposes Bystander Privacy Fine-Tuning (BPFT), a method that improves the ability of audio LLMs to focus exclusively on the target speaker and refuse to disclose bystander information in multi-speaker environments. After BPFT, the SE metric surpasses Gemini 2.5 Pro by 16%.
Rethinking LLM Watermark Detection in Black-Box Settings: A Non-Intrusive Third-Party Framework: This paper proposes TTP-Detect, the first black-box third-party watermark verification framework that decouples detection from injection. By leveraging a proxy model to amplify watermark signals and combining three complementary metrics — local consistency, global geometry, and adaptive rank tests — it achieves high-accuracy detection across diverse watermarking schemes without access to secret keys or internal model states.
Synthia: Scalable Grounded Persona Generation from Social Media Data: This paper proposes Synthia, a framework that generates grounded LLM persona narratives from real social media posts (Bluesky), achieving up to 11.6% improvement over the state of the art on social survey alignment while using smaller models, and preserving social network topology to support network-aware analysis.
Topic-Based Watermarks for Large Language Models: This paper proposes TBW, a lightweight topic-based watermarking scheme that clusters the vocabulary into semantically coherent "green lists" via predefined topics (rather than random partitioning), selects the topic list most aligned with the input prompt for logit bias injection, and achieves text quality comparable to unwatermarked outputs while significantly improving robustness against paraphrase and lexical perturbation attacks.
Toward Consistent World Models with Multi-Token Prediction and Latent Semantic Enhancement: This paper provides a theoretical analysis of how Multi-Token Prediction (MTP) induces representational contractiveness through gradient coupling mechanisms to promote the emergence of belief states. It simultaneously reveals a "structural hallucination" problem in MTP—namely, illegal shortcuts in the latent space—and proposes the LSE-MTP framework, which anchors predictions to true latent state trajectories via latent consistency loss and semantic anchoring loss. The approach significantly improves path legality and robustness on synthetic graphs and real-world Manhattan taxi navigation tasks.
Two Pathways to Truthfulness: On the Intrinsic Encoding of LLM Hallucinations: This paper identifies two distinct information pathways through which LLMs internally encode truthfulness signals: Question-Anchored (relying on information flow from question to answer) and Answer-Anchored (extracting self-contained evidence from the generated answer itself). Both pathways are closely associated with knowledge boundaries. Building on this finding, the paper proposes two pathway-aware hallucination detection methods—Mixture-of-Probes and Pathway Reweighting—achieving AUC improvements of up to 10%.
Who Gets Which Message? Auditing Demographic Bias in LLM-Generated Targeted Text: This paper presents the first systematic analysis of demographic bias in LLM-generated targeted messages, proposes the Persuasion Bias Index (PBI), and finds that GPT-4o, Llama, and Mistral consistently employ stronger persuasive strategies toward male and younger audiences in climate communication, with contextual prompting systematically amplifying these disparities.
Why Supervised Fine-Tuning Fails to Learn: A Systematic Study of Incomplete Learning in Large Language Models: This paper presents the first systematic study of the Incomplete Learning Phenomenon (ILP) in SFT — i.e., the model's inability to correctly reproduce a subset of training samples even after convergence. Five recurring causes are identified (knowledge absence, knowledge conflict, intra-dataset contradiction, left-side forgetting, and insufficient optimization), along with a diagnostic framework and targeted mitigation strategies.
XMark: Reliable Multi-Bit Watermarking for LLM-Generated Texts: This paper proposes XMark, a multi-bit text watermarking method based on the Leave-one-Shard-out (LoSo) strategy and evergreen lists. By taking the intersection of green lists across multiple vocabulary permutations and employing a constrained token-shard mapping matrix, XMark significantly improves decoding accuracy under limited token budgets while preserving text quality.

🌐 Multilingual & Translation¶

A Multilingual Dataset and Empirical Validation for the Mutual Reinforcement Effect in Information Extraction: This work constructs the first multilingual MRE Mix dataset (MMM, 21 subsets covering English, Chinese, and Japanese) and systematically validates through large-scale ablation experiments that the Mutual Reinforcement Effect (MRE) between word-level and text-level information extraction tasks exists universally across languages.
Beyond Literal Mapping: Benchmarking and Improving Non-Literal Translation Evaluation: This paper constructs MENT, a non-literal translation meta-evaluation dataset comprising 7,530 human-annotated instances, reveals the unreliability of traditional metrics and LLM-as-Judge approaches on non-literal translation evaluation, and proposes RATE, an agentic evaluation framework in which a reflective Core Agent dynamically invokes sub-agents to improve correlation with human judgments by 3.2+ points.
BhashaSutra: A Task-Centric Unified Survey of Indian NLP Datasets, Corpora, and Resources: This is the first unified survey dedicated to Indian language NLP resources, covering 200+ datasets, 50+ benchmarks, and 100+ models/tools. Resources are organized under 17 task categories spanning core linguistic processing to sociocultural tasks. The survey systematically analyzes persistent challenges including uneven language coverage, annotation fragmentation, and evaluation inconsistency.
Efficient Training for Cross-lingual Speech Language Models: This paper proposes CSLM, a data-efficient method for training cross-lingual speech LLMs. It introduces a novel alignment strategy to achieve cross-modal and cross-lingual alignment simultaneously, and presents a speech-text interleaved chain-of-modality generation paradigm to improve quality and reduce latency—without requiring large-scale speech data to extend to new languages.
Exploring Two-Phase Continual Instruction Fine-tuning for Multilingual Adaptation in Large Language Models: This paper proposes a two-phase continual fine-tuning (CFT) framework—first fine-tuning on English instruction data, then on multilingual data—and finds that instruction similarity between the two phases is the key factor determining whether English capability degrades. Generative replay and heuristic layer freezing are shown to effectively mitigate representation drift and English forgetting caused by dissimilar datasets.
IndoTabVQA: A Benchmark for Cross-Lingual Table Understanding in Bahasa Indonesia Documents: This paper presents IndoTabVQA, a cross-lingual visual question answering benchmark for table understanding in Bahasa Indonesia documents. The dataset comprises 1,593 document images annotated with QA pairs in four languages (Indonesian, English, Hindi, and Arabic). The benchmark reveals substantial performance gaps in VLMs for low-resource languages and cross-lingual table understanding, with fine-tuning combined with spatial priors achieving up to 48.5% In-Match accuracy.
Just Use XML: Revisiting Joint Translation and Label Projection: This paper proposes LabelPigeon, a joint translation and label projection method based on XML markup. By fine-tuning the NLLB-200 translation model on high-quality XML-annotated parallel corpora, LabelPigeon surpasses all baselines across 11 languages while actively improving translation quality, achieving gains of up to +40.2 F1 on downstream cross-lingual NER tasks.
Language Models Entangle Language and Culture: This paper evaluates multilingual LLMs on culturally neutral, open-ended advice-seeking questions derived from the WildChat dataset. It finds that query language systematically affects both response quality and cultural context — low-resource language queries yield notably lower quality responses than English, and language choice implicitly shifts the cultural framing of responses. A translated version of CulturalBench further validates the entanglement between language and culture in LLMs.
Language on Demand, Knowledge at Core: Composing LLMs with Encoder-Decoder Translation Models for Extensible Multilinguality: This paper proposes XBridge, an architecture that composes pretrained multilingual encoder-decoder translation models (e.g., NLLB) with English-centric LLMs — the encoder handles multilingual understanding, the LLM handles knowledge reasoning, and the decoder handles multilingual generation. Lightweight mapping layers and optimal transport alignment are employed to bridge cross-model semantic gaps, yielding significant improvements over baselines on low-resource and unseen languages.
Location Not Found: Exposing Implicit Local and Global Biases in Multilingual LLMs: This paper introduces LocQA, a benchmark comprising 2,156 location-sensitive QA pairs across 12 languages and 49 regions. By employing geographically ambiguous queries (e.g., "What is the emergency phone number?"), it exposes implicit biases in LLMs: a persistent US-centric default across languages (50% of model responses contain US answers vs. only 26% in the data), a within-language "demographic probability engine" effect driven by population size, and an exacerbation of global bias following instruction fine-tuning.
Lost in Translation: Do LVLM Judges Generalize Across Languages?: This paper introduces MM-JudgeBench, the first large-scale multilingual multimodal judge benchmark (25 languages, 60K+ preference instances), evaluating 22 LVLMs and revealing significant cross-lingual performance disparities in current LVLM judges. Model size and architecture cannot predict multilingual robustness, and even state-of-the-art judges exhibit inconsistent behavior, underscoring the necessity of multilingual multimodal evaluation benchmarks.
LQM: Linguistically Motivated Multidimensional Quality Metrics for Machine Translation: This paper proposes LQM (Linguistically Motivated Multidimensional Quality Metrics), a six-tier linguistically motivated MT error taxonomy spanning sociolinguistics → pragmatics → semantics → morphosyntax → orthography → graphetics. A bidirectional parallel corpus of 3,850 sentences across seven Arabic dialects is constructed, and 6,113 expert-annotated error spans are produced to reveal systematic deficiencies of existing MT systems in dialect-aware and culturally sensitive translation.
Mitigating Extrinsic Gender Bias for Bangla Classification Tasks: To address extrinsic gender bias in pretrained language models applied to Bangla downstream classification tasks, this paper proposes RandSymKL, a method that jointly optimizes randomized cross-entropy loss and symmetric KL divergence to effectively reduce gender prediction disparities while maintaining classification accuracy.
MORPHOGEN: A Multilingual Benchmark for Evaluating Gender-Aware Morphological Generation: This paper introduces MORPHOGEN, a large-scale gender-aware morphological generation benchmark covering French, Arabic, and Hindi (20,328 sentence pairs in total). It defines the GENFORM task (rewriting first-person sentences into the opposite gender), proposes three evaluation metrics—SGA, GIoU, and CGA—and benchmarks 15 multilingual LLMs, revealing systematic deficiencies in complex morphological reasoning, gender bias, and multi-entity interference.
Multilingual Language Models Encode Script Over Linguistic Structure: This paper systematically analyzes language-associated units in multilingual LMs using the LAPE metric and sparse autoencoders, finding that these units are primarily driven by orthography (writing system) rather than abstract linguistic structure. Romanization activates nearly entirely disjoint sets of neurons; word-order shuffling has minimal effect; typological information becomes accessible only gradually in deeper layers; and causal interventions reveal that functional importance correlates with surface-form invariance.
No One Fits All: From Fixed Prompting to Learned Routing in Multilingual LLMs: This paper demonstrates that no single prompting strategy is universally optimal across all languages and tasks. It proposes to model strategy selection as a learned decision problem, using a lightweight classifier to predict the optimal strategy for each instance, achieving significant improvements over fixed strategies on four benchmarks.
Prosody as Supervision: Bridging the Non-Verbal–Verbal for Multilingual Speech Emotion Recognition: This paper proposes NOVA-ARC, the first framework to formulate multilingual speech emotion recognition (SER) as an unsupervised transfer problem from labeled non-verbal vocalizations (NVV) to unlabeled verbal speech (UVS). By leveraging a hyperbolic prosody vector-quantized codebook, a Hyperbolic Emotion Lens, and optimal transport prototype alignment, NOVA-ARC achieves cross-modal emotion transfer and validates the feasibility and superiority of NVV→UVS transfer across 6 datasets.
SERM: Self-Evolving Relevance Model with Agent-Driven Learning from Massive Query Streams: This paper proposes the SERM framework, which continuously self-evolves a search relevance model from large-scale real-world query streams via a multi-agent sample miner and a multi-agent relevance annotator. After three iterative rounds on an industrial search platform, SERM achieves a NDCG@1 improvement of +2.99, and significantly improves user retention in online A/B testing.
Syntax as a Rosetta Stone: Universal Dependencies for In-Context Coptic Translation: This paper is the first to explore Universal Dependencies (UD) syntactic information as an augmentation source for in-context learning (ICL) in low-resource Coptic-to-English machine translation. While syntactic information alone is less effective than a bilingual lexicon, combining lexicon with syntactic information (LEX+SYN) achieves the best results across all tested models, with Gemma-27B reaching a BERTScore F1 of 0.8746 (+0.0361).
The GaoYao Benchmark: A Comprehensive Framework for Evaluating Multilingual and Multicultural Abilities of Large Language Models: This paper presents the GaoYao benchmark, comprising 182.3K samples across 26 languages and 51 countries/regions. Through a three-tier cultural evaluation framework (general multilingual / cross-cultural / mono-cultural) and nine cognitive sub-layers, combined with a human-localized subjective test set and an expert-validated cross-cultural synthetic dataset SuperBLEnD, GaoYao performs in-depth diagnosis of 20+ flagship and compact LLMs, revealing pronounced geographic digital divides and task-level capability stratification.
Unlocking the Edge: Multi-LoRA On-Device Deployment and Acceleration: This paper presents an on-device LLM deployment framework for Samsung Galaxy S24/S25, achieving dynamic task switching by treating LoRA weights as runtime inputs, reducing style-variant generation latency by 6× via multi-stream concurrent token generation, and accelerating decoding by 2.3× through draft-model-free Dynamic Self-Speculative Decoding—yielding an overall 4–6× optimization across 9 languages and 8 tasks.
Vocab Diet: Reshaping the Vocabulary of LLMs via Vector Arithmetic: This paper demonstrates that LLMs encode morphological inflections (e.g., walk→walked) as linear directions in embedding space, and proposes a compositional vocabulary design: replacing independently assigned tokens for each surface form with additive combinations of base words and transformation vectors. With the pretrained backbone frozen, only a small adapter module is trained, freeing 10–40% of vocabulary slots for multilingual expansion with negligible impact on downstream performance.
What Factors Affect LLMs and RLLMs in Financial Question Answering?: This paper systematically investigates how prompting methods, agent frameworks, and multilingual alignment approaches affect LLMs and RLLMs (Reasoning Large Language Models) on financial question answering tasks. The key finding is that existing methods essentially improve LLM performance by simulating Long CoT, but offer limited gains for RLLMs that already possess native Long CoT capabilities.
Why Do Multilingual Reasoning Gaps Emerge in Reasoning Language Models?: This paper presents the first systematic investigation into the sources of multilingual reasoning gaps in reasoning language models (RLMs), identifying language understanding failure as the primary cause, and proposes Selective Translation—applied only upon detected understanding failure—as an efficient mitigation strategy.

💻 Code Intelligence¶

Across Programming Language Silos: A Study on Cross-Lingual Retrieval-Augmented Code Generation: This paper presents the first systematic study of cross-programming-language retrieval-augmented code generation (RACG), constructing a 14K-instance dataset spanning 13 programming languages, and reveals the asymmetry of cross-lingual knowledge transfer and its relationship to language family relatedness and pretraining data diversity.
CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment: This paper proposes CodeRL+, which integrates execution semantics alignment into the RLVR training pipeline. By training the model to infer variable-level execution traces, CodeRL+ bridges the gap between code text representations and execution semantics, achieving an average pass@1 improvement of 4.6% on code generation, 15.5% on code reasoning, and 4.4% on test output generation benchmarks.
CodeWiki: Evaluating AI's Ability to Generate Holistic Documentation for Large-Scale Codebases: This paper presents CodeWiki, an open-source framework based on hierarchical decomposition and recursive multi-agent processing for automated repository-level code documentation generation. It also introduces the CodeWikiBench benchmark, achieving a quality score of 68.79% across seven programming languages, surpassing the closed-source system DeepWiki (64.06%).
CollabCoder: Plan-Code Co-Evolution via Collaborative Decision-Making for Efficient Code Generation: This paper proposes CollabCoder, a plan-code co-evolution framework that employs a Collaborative Decision Module (CDM) to determine whether errors should be repaired at the plan level or the code level, and a Reasoning Trajectory module (RT) to enable self-improving debugging that learns from failures. CollabCoder outperforms strong baselines by 11–20% on challenging programming benchmarks while reducing API calls by 4–10.
DeepGuard: Secure Code Generation via Multi-Layer Semantic Aggregation: DeepGuard is proposed to overcome the "final-layer bottleneck" by aggregating representations from multiple upper Transformer layers via an attention mechanism. Combined with multi-objective training and a lightweight inference-time safety guidance strategy, it achieves an average improvement of 11.9% in secure-and-correct generation rate across 5 code LLMs.
EET: Experience-Driven Early Termination for Cost-Efficient Software Engineering Agents: This paper proposes EET, an experience-driven early termination method that identifies unproductive iterations during patch generation and patch selection phases, reducing the total cost of SE agents by 19%–55% (32% on average) with negligible performance degradation (at most 0.2%).
From Charts to Code: A Hierarchical Benchmark for Multimodal Models: This paper proposes Chart2Code, a hierarchical benchmark comprising 2,186 tasks spanning 22 chart types, organized into three progressively challenging levels: chart reproduction (Level 1), chart editing (Level 2), and long-table-to-chart generation (Level 3). The benchmark evaluates 29 state-of-the-art multimodal models and reveals that even the strongest model, GPT-5.2, achieves a chart quality score of only 33.41 on editing tasks, exposing significant deficiencies in current models for practical chart code generation.
From If-Statements to ML Pipelines: Revisiting Bias in Code-Generation: This work demonstrates that existing bias evaluations of LLM code generation severely underestimate real-world risk: in ML pipeline generation, sensitive attributes appear in 87.7% of feature selection decisions (vs. 59.2% in conditional statements), and models correctly exclude irrelevant features yet consistently retain sensitive attributes such as race and gender, revealing systematic implicit discrimination.
LogicEval: A Systematic Framework for Evaluating Automated Repair Techniques for Logical Vulnerabilities in Real-World Software: This paper presents LogicEval, the first systematic evaluation framework for logical vulnerability repair, along with the LogicDS dataset (61 real-world logical vulnerabilities + 61 synthetic Java samples). It systematically evaluates both traditional AVR tools and LLMs on logical vulnerability repair, finding that LLMs perform best when provided with auxiliary information yet overall repair rates remain low (only 5 out of 61 real-world samples correctly repaired). Key bottlenecks identified include prompt sensitivity, context loss, and patch localization difficulty.
MARS2: Scaling Multi-Agent Tree Search via Reinforcement Learning for Code Generation: MARS2 proposes a multi-agent reinforcement tree search framework that embeds multiple independently optimized policies into a shared search tree for collaborative exploration. Through Thompson sampling for agent–node pair selection, tree-consistent reward shaping, and path-level group advantage estimation, the framework consistently improves single-model Pass@1 by up to 8.0% and system-level Pass@1 (MCTS) by up to 6.5% on code generation benchmarks.
OmniDiagram: Advancing Unified Diagram Code Generation via Visual Interrogation Reward: This paper proposes OmniDiagram, a unified diagram code generation framework covering three languages (LaTeX/Mermaid/PlantUML) and three tasks (diagram-to-code, diagram editing, text-to-code). It introduces the Viva (Visual Interrogation Verifies All) reward mechanism based on visual question answering to guide RL training, achieving state-of-the-art performance on multiple benchmarks.
Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?: This paper reveals the "regeneration" tendency of frontier LLMs on debugging tasks. By introducing the PDB framework along with edit-level precision and bug-level recall metrics, the authors find that models such as GPT-5.1-Codex pass over 76% of unit tests yet achieve edit precision below 45%, and that iterative and agent-based debugging strategies fail to substantially improve precision.
QiMeng-PRepair: Precise Code Repair via Edit-Aware Reward Optimization: This paper identifies the "over-editing" problem in LLM-based code repair—where models tend to rewrite large portions of code rather than precisely localizing and fixing bugs—and proposes the PRepair framework. Through Self-Breaking (diversified bug injection) and Self-Repairing (edit-aware GRPO training), PRepair significantly improves repair precision while maintaining correctness and accelerating speculative decoding inference.
ReFEree: Reference-Free and Fine-Grained Method for Evaluating Factual Consistency in Real-World Code Summarization: This paper proposes ReFEree, a reference-free and fine-grained factual consistency evaluation method for real-world code summarization. It defines four categories of inconsistency criteria and evaluates at the sentence-segment level. Combined with a dependency information retrieval mechanism, ReFEree achieves 15–18% improvement in human judgment correlation over the previous state of the art on Python and Java.
River-LLM: Large Language Model Seamless Exit Based on KV Share: This paper proposes River-LLM, a training-free framework that addresses the KV Cache absence problem in Early Exit for decoder-only architectures by constructing a lightweight KV-shared exit channel (Exit River). It leverages state transition similarity to guide exit decisions, achieving 1.71×–2.16× real wall-clock inference speedup with near-lossless generation quality.
Sense and Sensitivity: Examining the Influence of Semantic Recall on Long Context Code Understanding: This paper distinguishes between lexical recall (verbatim code retrieval) and semantic recall (understanding runtime code semantics), demonstrating that frontier LLMs achieve near-perfect lexical recall yet exhibit severe semantic recall degradation in long contexts. The paper introduces the SemTrace benchmark, revealing that existing evaluations substantially underestimate the extent of semantic understanding failures.
SOCIA-EVO: Automated Simulator Construction via Dual-Anchored Bi-Level Optimization: This paper proposes SOCIA-EVO, an LLM agent framework that reformulates automated simulator construction as a dual-anchored evolutionary process. It anchors empirical constraints via a static Blueprint, decouples structural revision and parameter calibration through bi-level optimization, and manages repair hypotheses via a self-curated strategy Playbook with Bayesian-weighted retrieval guided by execution feedback. SOCIA-EVO significantly outperforms baselines such as Reflexion and G-SIM on three simulation tasks: user modeling, mask-wearing diffusion, and personal mobility.
SolidCoder: Bridging the Mental-Reality Gap in LLM Code Generation through Concrete Execution: SolidCoder transforms code verification from LLM "imagined execution" to "real execution" via the S.O.L.I.D. architecture (Shift-left Planning, Oracle-based Assertions, Live Execution, Intermediate Simulation, Defensive Accumulation), achieving pass@1 scores of 95.7% on HumanEval, 77.0% on CodeContests, and 26.7% on APPS with GPT-4o.
StoryCoder: Narrative Reformulation for Structured Reasoning in LLM Code Generation: This paper proposes StoryCoder, a prompting framework that reformulates code generation problems into coherent natural language narratives. By guiding LLMs through three narrative components—task overview, constraints, and examples—the framework achieves an average zero-shot pass@10 improvement of 18.7% across 11 models.
The Path Not Taken: Duality in Reasoning about Program Execution: This paper introduces the concept of duality in program execution reasoning. Through the DexBench benchmark (445 paired instances), it jointly evaluates LLMs on forward execution reasoning (predicting code coverage under a given input) and backward counterfactual reasoning (inferring input mutations that redirect execution to a target branch). The results reveal that strong performance in a single direction does not transfer to success under joint evaluation, exposing a fundamental deficiency in models' causal understanding of program execution.

⚖️ Alignment & RLHF¶

Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling: This paper proposes Plan-RewardBench, a trajectory-level preference benchmark targeting complex tool-augmented scenarios, designed to evaluate the ability of reward models to distinguish superior from inferior agent trajectories across multi-step planning, tool usage, and error recovery settings.
Alignment Data Map for Efficient Preference Data Selection and Diagnosis: This paper proposes the Alignment Data Map, an analytical tool that visualizes, selects, and diagnoses preference data by jointly considering response quality and variability. Using only 33% of the data, it achieves alignment performance comparable to full-data training.
Beyond Marginal Distributions: A Framework to Evaluate the Representativeness of Demographic-Aligned LLMs: This paper proposes a representativeness evaluation framework for LLMs that goes beyond marginal distributions by jointly examining marginal response distributions and cross-question correlation structures to assess demographic-aligned models. The findings reveal that while fine-tuning and persona prompting improve the approximation of marginal distributions, neither faithfully reproduces the multivariate correlation patterns observed in human values surveys.
ConsistRM: Improving Generative Reward Models via Consistency-Aware Self-Training: ConsistRM proposes a consistency-aware self-training framework for generative reward models (GRMs). It introduces two modules — temporal consistency pseudo-labels (integrating online-state and memory-driven preference consistency) and semantic consistency critique rewards (measuring semantic similarity across multiple generated critiques) — achieving an average improvement of 1.5% across five benchmarks without human annotation, while significantly mitigating position bias.
Into the Gray Zone: Domain Contexts Can Blur LLM Safety Boundaries: This paper demonstrates that domain-specific contexts (e.g., chemistry papers) selectively relax LLM safeguards on related harmful knowledge (vertical unlocking), while security research contexts trigger broad relaxation across all harmful categories (general unlocking). Based on these findings, the authors propose the Jargon attack framework, achieving over 93% attack success rate (ASR) on seven frontier models including GPT-5.2 and Claude-4.5.
Reward Modeling for Scientific Writing Evaluation: This paper proposes SciRM and SciRM-Ref, two open-source reward models tailored for scientific writing evaluation. Through two-stage reinforcement learning (GRPO) that separately optimizes evaluation preference and reasoning ability, these models achieve fine-grained multi-aspect evaluation across diverse scientific writing tasks and generalize to unseen evaluation tasks and criteria.
Robust Tool Use via Fission-GRPO: Learning to Recover from Execution Errors: This paper proposes Fission-GRPO, which dynamically converts tool execution errors into on-policy corrective training instances within the RL training loop. A learned error simulator generates diagnostic feedback, and recovery trajectories are resampled from the augmented context. The approach improves the error recovery rate of Qwen3-8B by 5.7% and raises overall accuracy from 42.75% to 46.75%.
S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models: This paper proposes a Simple-to-Hard (S2H) DPO framework that constructs multi-image preference data across three progressively harder levels (anchored reasoning → cross-image comparison → global visual search), systematically improving VLM multi-image reasoning while preserving single-image performance.
SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging: This paper proposes SafeMERGE, a lightweight post-fine-tuning framework that detects layers deviating from safe behavior via cosine similarity, and selectively merges only those layers with their counterparts from a safety model. Across four LLMs, the method significantly reduces harmful outputs while maintaining or even improving task performance.
SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe: This paper proposes SFTMix, a Mixup-based instruction tuning method that partitions SFT data into high-confidence and low-confidence subsets via training dynamics, applies linear interpolation between the two subsets in the hidden representation space with Mixup regularization, and consistently improves instruction-following ability across LLM families and dataset scales without relying on high-quality curated datasets.
STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming: This paper proposes STAR-Teaming, an automated red teaming framework based on a Strategy-Response Multiplex Network, which models attack strategy selection as a probabilistic optimization of the inverse Ising problem. The framework achieves an average attack success rate (ASR) of 74.5% on HarmBench, outperforming the strongest baseline by 13.5%, while significantly reducing computational overhead.
Towards Bridging the Reward-Generation Gap in Direct Alignment Algorithms: This paper identifies the "reward-generation gap" in Direct Alignment Algorithms (DAAs)—a mismatch between training objectives and autoregressive decoding dynamics—and proposes POET (Prefix-Oriented Equal-length Training), which truncates preference response pairs to the length of the shorter response to implicitly constrain token-level MDP convergence across all timesteps, achieving up to 11.8 percentage point improvement on AlpacaEval 2.
TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense: This paper proposes TrajGuard, a training-free decoding-time jailbreak defense framework that quantifies risk in real time by aggregating hidden-state trajectories from key layers within a sliding window, and triggers a lightweight semantic judge only when risk persistently exceeds a threshold. TrajGuard achieves an average defense rate of 95% across 12 jailbreak attacks, with a detection latency of only 5.2 ms/token and a false positive rate below 1.5%.

🎁 Recommender Systems¶

Beyond Itinerary Planning: A Real-World Benchmark for Multi-Turn and Tool-Using Travel Tasks: This paper proposes TravelBench, the first travel planning benchmark that integrates real user queries, implicit user preferences, multi-turn interaction, unsolvable task recognition, and 10 real-world tools. It enables reproducible evaluation through a sandbox environment and reveals that state-of-the-art models exhibit uneven performance across different capability dimensions.
Content Fuzzing for Escaping Information Cocoons on Social Media: This paper proposes ContentFuzz, a confidence-guided fuzzing framework from the content creator's perspective. It leverages LLMs to rewrite posts such that the machine-inferred stance label changes while the human-interpreted meaning remains unchanged, thereby breaking information cocoons on social media.
Decisive: Guiding User Decisions with Optimal Preference Elicitation from Unstructured Documents: This paper proposes DECISIVE, an interactive decision-making framework that extracts an objective option scoring matrix from unstructured documents and combines it with Bayesian preference inference to adaptively select pairwise comparison questions, efficiently learning users' latent preference vectors. The system minimizes user interaction burden while delivering transparent, personalized recommendations, achieving up to 20% higher decision accuracy over strong baselines.
From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents: This paper proposes the Memora benchmark and the FAMA metric, extending long-term memory evaluation beyond shallow fact retrieval to memory consolidation and mutation handling spanning weeks to months, revealing systematic failures of existing LLMs and memory agents under frequent knowledge updates.
HARPO: Hierarchical Agentic Reasoning for User-Aligned Conversational Recommendation: This paper proposes HARPO, a framework that reformulates conversational recommendation as a structured decision-making problem explicitly optimizing for recommendation quality. HARPO integrates four components—hierarchical preference learning, value-network-guided tree search reasoning, virtual tool operation abstraction, and multi-agent refinement—achieving significant improvements over existing methods on three benchmarks: ReDial, INSPIRED, and MUSE.
HORIZON: A Benchmark for in-the-wild User Behaviour Modeling: This paper presents HORIZON, the first fully open-source large-scale cross-domain long-term recommendation benchmark. Built by merging all categories of Amazon Reviews into a unified interaction history covering 54M users and 35M items, HORIZON introduces a four-quadrant evaluation protocol that orthogonally decouples the temporal and user axes. The benchmark reveals that models such as BERT4Rec perform strongly in-distribution but degrade significantly under temporal extrapolation and unseen-user settings, and that LLMs do not consistently outperform dedicated architectures for user behaviour modeling.
IceBreaker for Conversational Agents: Breaking the First-Message Barrier with Personalized Starters: This paper proposes IceBreaker, a two-step "handshake" framework—resonance-aware interest distillation to capture trigger interests, followed by interaction-oriented starter generation with personalized preference alignment—to address the "first-message barrier" in conversational agents. A/B testing on one of the world's largest conversational products yields +1.84‰ active days and +94.25‰ CTR.
Learning to Retrieve User History and Generate User Profiles for Personalized Persuasiveness Prediction: This paper proposes ReCAP, a framework featuring a trainable query generator and a user profile generator that retrieves persuasion-relevant information from user history and constructs context-aware user profiles, significantly improving personalized persuasiveness prediction.
Personalized Benchmarking: Evaluating LLMs by Individual Preferences: This paper analyzes personalized rankings for 115 active Chatbot Arena users and finds that the average Spearman correlation between Bradley-Terry personalized rankings and the global ranking is only \(\rho=0.04\) (with 57% of users exhibiting near-zero or negative correlation), demonstrating that aggregated benchmarks fail to reflect individual user preferences. Topic and style features are shown to successfully predict user-specific model rankings.
Scripts Through Time: A Survey of the Evolving Role of Transliteration in NLP: This paper presents a systematic survey of the evolving role of transliteration in cross-lingual NLP. It proposes a five-category motivation taxonomy (named entity/OOV handling, code-mixing, cross-script similarity exploitation, English-centric transfer, and unified preprocessing), compares six integration strategies, and discusses whether transliteration remains necessary in the era of modern LLMs.
What Makes an Ideal Quote? Recommending "Unexpected yet Rational" Quotations via Novelty: NOVELQR proposes a novelty-driven quote recommendation framework that constructs a deep semantic knowledge base via a generative label agent to enable semantically rational retrieval, and employs a token-level novelty estimator to mitigate autoregressive continuation bias, achieving significant improvements on a bilingual benchmark.
What Makes LLMs Effective Sequential Recommenders? A Study on Preference Intensity and Temporal Context: This paper identifies that binary preference modeling in existing LLM-based recommender systems discards two critical signals—preference intensity and temporal context—and proposes RecPO, a framework that incorporates both factors into preference optimization via adaptive reward margins, substantially outperforming S-DPO and other baselines across five datasets.
Where and What: Reasoning Dynamic and Implicit Preferences in Situated Conversational Recommendation: SiPeR addresses the challenge of dynamically shifting and implicitly expressed user preferences in situated conversational recommendation (SCR) via two mechanisms — Scene Transition Estimation ("Where") and Bayesian Inverse Inference ("What") — achieving improvements of 10.9% and 10.6% on SIMMC 2.1 and SCREEN, respectively.

🎨 Image Generation¶

AFMRL: Attribute-Enhanced Fine-Grained Multi-Modal Representation Learning in E-commerce: This paper proposes the AFMRL framework, which formulates fine-grained product understanding in e-commerce as an attribute generation task. An MLLM generates key attributes to enhance contrastive learning (AGCL), while retrieval performance serves as a reward signal to inversely optimize the attribute generator (RAR), achieving state-of-the-art retrieval performance on large-scale e-commerce datasets.
BookAgent: Orchestrating Safety-Aware Visual Narratives via Multi-Agent Cognitive Calibration: BookAgent is a safety-aware multi-agent framework that generates high-quality, character-consistent, and content-safe picture books end-to-end from user drafts through a three-stage closed-loop architecture: Value-Aligned Storyboard (VAS) + Iterative Cross-Modal Refinement (ICR) + Temporal Cognitive Calibration (TCC).
CoDial: Interpretable Task-Oriented Dialogue Systems Through Dialogue Flow Alignment: This paper proposes CoDial, a framework that converts predefined dialogue flows (task schemas) into structured heterogeneous graphs and automatically generates LLM guardrail code (e.g., Colang), achieving interpretable and controllable task-oriented dialogue policies at inference time. It reaches SOTA on the STAR benchmark without requiring training data.
ControlAudio: Tackling Text-Guided, Timing-Indicated and Intelligible Audio Generation via Progressive Diffusion Modeling: This paper proposes ControlAudio, a unified progressive diffusion modeling framework that achieves three capabilities—text-guided generation, precise temporal control, and intelligible speech synthesis—within a single diffusion model through three-stage progressive training (TTA pretraining → temporal control fine-tuning → joint temporal+intelligible speech training) and progressive guidance sampling, significantly outperforming existing methods in temporal precision and speech intelligibility.
Follow the Flow: On Information Flow Across Textual Tokens in Text-to-Image Models: This paper systematically investigates token-level information distribution in text encoder outputs of text-to-image models through a causal intervention framework, discovering that lexical item semantics are typically concentrated in 1-2 representative tokens, and that cross-item information flow leads to semantic leakage and image misinterpretation in 11% of cases. The paper proposes simple yet effective token-level intervention methods to improve alignment.
From Past To Path: Masked History Learning for Next-Item Prediction in Generative Recommendation: This paper proposes Masked History Learning (MHL), a training framework that introduces masked history reconstruction as an auxiliary task alongside autoregressive training in generative recommendation. By combining entropy-guided adaptive masking strategies and curriculum learning schedulers, the model shifts from merely predicting "what's next" to understanding "why this path formed," significantly outperforming SOTA on three datasets.
Investigating Counterfactual Unfairness in LLMs towards Identities through Humor: This paper systematically investigates counterfactual unfairness in LLMs through humor scenarios—observing behavioral changes after swapping speaker/listener identities. Results reveal that jokes told by privileged-group speakers are refused at a rate as high as 67.5%, are judged as malicious with 64.7% higher probability, and receive social harm scores up to 1.5 points (on a 5-point scale), demonstrating that models have internalized fixed social privilege hierarchies rather than performing genuine social reasoning.
Large Language Models Are Bad Dice Players: LLMs Struggle to Generate Random Numbers from Statistical Distributions: This paper presents the first large-scale systematic audit of the native sampling capability of 11 frontier LLMs across 15 probability distributions, demonstrating that LLMs severely lack intrinsic probabilistic sampling mechanisms and that this deficiency propagates into downstream applications as systematic bias.
MASH: Evading Black-Box AI-Generated Text Detectors via Style Humanization: This paper proposes MASH (Multi-stage Style Humanization), a three-stage pipeline consisting of style-injection SFT → DPO alignment → inference-time refinement, which trains a rewriter with only 0.1B parameters to evade AI-generated text detectors in a black-box setting with an average attack success rate of 92%, while maintaining high linguistic quality.
VisRet: Visualization Improves Knowledge-Intensive Text-to-Image Retrieval: This paper proposes Visualize-then-Retrieve (VisRet), a novel retrieval paradigm that first visualizes a text query into images via a T2I generative model and then performs retrieval within the image modality. VisRet achieves an average nDCG@30 improvement of 0.125 (CLIP) and 0.121 (E5-V) across four benchmarks, and improves downstream VQA accuracy by 15.7% on Visual-RAG-ME.
ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching: This paper proposes ZipVoice-Dialog, the first non-autoregressive zero-shot spoken dialogue generation model based on flow matching. Through two simple designs—curriculum learning and speaker-turn embeddings—it addresses the unintelligible speech and turn confusion problems that arise when flow matching is directly applied to dialogue scenarios. The paper also releases OpenDialog (6.8k hours), the first large-scale open-source spoken dialogue dataset.

📹 Video Understanding¶

ArrowGEV: Grounding Events in Video via Learning the Arrow of Time: This paper proposes ArrowGEV, a reinforcement learning framework inspired by the physics concept of "arrow of time," which models temporal directionality in videos by distinguishing between temporally sensitive and insensitive events, improving VLM event localization accuracy and temporal understanding capabilities.
Distorted or Fabricated? A Survey on Hallucination in Video LLMs: This paper presents the first systematic taxonomy of hallucination phenomena in Video Large Language Models (Vid-LLMs), proposing a mechanism-driven classification framework that distinguishes between "dynamic distortion" (spatiotemporal relationship and reference consistency errors) and "content fabrication" (driven by statistical priors and audio-visual conflicts), while surveying evaluation benchmarks, mitigation strategies, and root cause analysis.
GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents: This paper presents GameplayQA, an end-to-end benchmarking framework built on multi-player 3D game videos. Through dense timeline annotation (1.22 labels/second) and a structured distractor taxonomy, it systematically evaluates multimodal large language models (MLLMs) on perception and reasoning in decision-dense, multi-view synchronized scenarios, revealing a substantial gap between frontier models and human performance.
HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding: This paper proposes HERMES, a training-free framework for efficient streaming video understanding grounded in a mechanistic analysis of layer-wise attention preferences in MLLMs. KV caches are conceptualized as a hierarchical memory system — shallow layers as sensory memory, middle layers as working memory, and deep layers as long-term memory — enabling real-time streaming video QA with a 68% reduction in video tokens while maintaining or improving accuracy, achieving TTFT latency below 30ms, which is 10× faster than the previous SOTA.
Preference Estimation via Opponent Modeling in Multi-Agent Negotiation: This paper proposes a preference estimation method that integrates LLM-extracted natural language preference signals into a Bayesian opponent modeling framework. In multi-party, multi-issue negotiations, it fuses qualitative cues and quantitative bid information via a linguistic likelihood function, improving the full agreement rate (FAR) from 37% to 62%.
Probing for Reading Times: This paper probes the ability of representations from individual layers of language models to predict reading times, finding that early-layer representations outperform surprisal on early fixation measures, while surprisal performs better on late measures, and that the best predictor varies by language and metric.
RARE: Redundancy-Aware Retrieval Evaluation Framework for High-Similarity Corpora: This paper proposes the RARE framework, which tracks cross-document redundancy by decomposing documents into atomic facts and introduces CRRF (Criterion-separated Reciprocal Rank Fusion) to stabilize multi-criteria LLM judgments. The framework constructs the RedQA benchmark over high-redundancy enterprise corpora in finance, legal, and patent domains, revealing that mainstream retrievers suffer a dramatic collapse in PerfRecall@10 from 66.4% to 5.0–27.9% under 4-hop high-overlap settings.
Saber: Efficient Sampling with Adaptive Acceleration and Backtracking Enhanced Remasking for DLMs: This paper proposes Saber, a training-free sampling algorithm for diffusion language models (DLMs) that achieves an average Pass@1 improvement of 1.9% on code generation while delivering 251.4% inference speedup. This is accomplished through two strategies: adaptive acceleration (dynamically adjusting the amount of parallel decoding based on established context) and backtracking-enhanced remasking (revoking tokens falsified by newly established context).
VC-Inspector: Advancing Reference-free Evaluation of Video Captions with Factual Analysis: This paper proposes VC-Inspector, a reference-free video caption evaluation metric based on open-source lightweight multimodal models (Qwen2.5-VL 3B/7B), which generates training data through a controllable factual error synthesis pipeline and achieves \(\tau_b\)=42.58 human judgment correlation on VATEX-Eval, outperforming GPT-4o-based G-VEval (\(\tau_b\)=39.40), while reaching 99.6% accuracy on hallucination detection benchmarks.
ViLL-E: Video LLM Embeddings for Retrieval: This paper proposes ViLL-E, the first unified Video LLM architecture supporting both text generation and embedding generation. Through a three-stage joint generative-contrastive training strategy and an adaptive KV-Former embedding head, ViLL-E approaches expert models on video retrieval and temporal grounding while maintaining competitive performance on VideoQA.
VISTA: Verification In Sequential Turn-based Assessment: VISTA proposes a multi-turn dialogue factuality assessment framework based on claim-level decomposition and sequential consistency tracking, subdividing unverifiable content into four categories: subjective, contradicted, lacking evidence, and abstention. It significantly outperforms FActScore and LLM-as-Judge baselines across four dialogue benchmarks and eight LLMs.

🗣️ Dialogue Systems¶

Agentic Conversational Search with Contextualized Reasoning via Reinforcement Learning: This paper proposes ConvAgent, which trains a conversational search agent to alternate between retrieval and reasoning across multi-turn interactions by decomposing the RL training reward into three complementary components: outcome reward, information gain reward, and mixed-initiative action reward.
APEX-MEM: Agentic Semi-Structured Memory with Temporal Reasoning for Long-Term Conversational AI: APEX-MEM proposes a conversational memory system based on a Property Graph, append-only event storage, and a multi-tool retrieval agent. Through a domain-agnostic ontology and retrieval-time temporal reasoning, it achieves 88.88% and 86.2% accuracy on LOCOMO and LongMemEval respectively, significantly outperforming existing structured memory approaches.
Author-in-the-Loop Response Generation and Evaluation: Integrating Author Expertise and Intent in Responses to Peer Review: This paper reframes academic rebuttal generation as an "author-in-the-loop" task, contributing the Re3Align dataset (3.4K papers, 440K sentence-level edit annotations, 15K review–rebuttal–revision triples), the REspGen controllable generation framework, and the REspEval evaluation suite comprising 20+ metrics. The framework is systematically validated across 5 state-of-the-art LLMs, demonstrating the effectiveness of author input, controllability, and evaluation-guided refinement.
Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky: This paper proposes DiaFORGE, a disambiguation-centric synthetic data generation pipeline combined with chain-of-thought fine-tuning and a dynamic evaluation framework, enabling open-source LLMs to achieve tool-calling success rates 27 percentage points higher than GPT-4o and 49 percentage points higher than Claude-3.5-Sonnet when facing near-duplicate enterprise APIs.
Discourse Coherence and Response-Guided Context Rewriting for Multi-Party Dialogue Generation: This paper proposes DRCR, the first framework to introduce context rewriting into multi-party dialogue generation, using dual feedback signals of discourse coherence and response quality to construct preference data, and enabling the rewriter and responder to mutually enhance each other through iterative training via dynamic self-evolution.
ETHICMIND: A Risk-Aware Framework for Ethical-Emotional Alignment in Multi-Turn Dialogue: ETHICMIND proposes an inference-time risk-aware alignment framework that jointly analyzes ethical risks and user emotions at each turn of multi-turn dialogue, plans high-level response strategies, and generates replies that balance ethical guidance and emotional resonance, achieving more consistent alignment performance in high-risk and morally ambiguous scenarios without requiring additional training.
SPASM: Stable Persona-driven Agent Simulation for Multi-turn Dialogue Generation: This paper proposes SPASM, a stability-centric persona-driven multi-turn dialogue simulation framework that significantly reduces role drift and echo effects in LLM-LLM conversations through three components: modular persona generation, Egocentric Context Projection (ECP), and termination detection, constructing 45,000 high-quality multi-turn dialogue instances.
Template-assisted Contrastive Learning of Task-oriented Dialogue Sentence Embeddings: This paper proposes TaDSE, a framework that leverages existing template information in dialogues as auxiliary anchors. Through three stages—template-aware data augmentation, paired contrastive training, and semantic compression inference—TaDSE significantly improves sentence embedding quality for task-oriented dialogue in an unsupervised setting, surpassing previous SOTA and even outperforming supervised commercial embedding models on five benchmarks.
Towards Proactive Information Probing: Customer Service Chatbots Harvesting Value from Conversation: This paper proposes ProChatIP, a framework that transforms customer service chatbots from passive response tools into proactive information harvesting engines. A dedicated dialogue policy module learns when to probe users for preset target information while minimizing conversation turns and user friction.
VoxMind: An End-to-End Agentic Spoken Dialogue System: This paper proposes VoxMind, a unified framework that endows end-to-end spoken dialogue models with agentic capabilities: explicit reasoning through a "Think-before-Speak" mechanism, combined with a multi-agent dynamic tool management architecture that decouples reasoning latency from tool scale, improving task completion rate from baseline 34.88% to 74.57%, surpassing Gemini-2.5-Pro.

🔎 AIGC Detection¶

Beyond the Final Actor: Modeling the Dual Roles of Creator and Editor for Fine-Grained LLM-Generated Text Detection: This paper proposes RACE (Rhetorical Analysis for Creator-Editor Modeling), which leverages Rhetorical Structure Theory (RST) to construct logic graphs that model the "creator's" cognitive architecture, while extracting discourse unit-level features to capture the "editor's" linguistic style, achieving fine-grained four-class LLM-generated text detection (human-written/LLM-written/LLM-polished human text/human-rewritten LLM text).
BIASEDTALES-ML: A Multilingual Dataset for Analyzing Narrative Attribute Distributions in LLM-Generated Stories: BiasedTales-ML constructs a corpus of ~350K LLM-generated children's stories across 8 languages, using full-permutation prompt design and a distributional analysis framework to reveal that social attribute distributions in narratives vary significantly across languages, and English-centric evaluation fails to capture bias patterns in multilingual settings.
CiteGuard: Faithful Citation Attribution for LLMs via Retrieval-Augmented Validation: CiteGuard proposes a retrieval-augmented agent framework with extended retrieval actions (including full-text search and context retrieval) to provide a more faithful basis for scientific citation attribution, achieving 68.1% accuracy on the CiteME benchmark — a 10-point improvement over baselines, approaching human performance (69.2%).
FlexGuard: Continuous Risk Scoring for Strictness-Adaptive LLM Content Moderation: FlexGuard outputs continuous risk scores (0-100) instead of binary safe/unsafe judgments for LLM content moderation, achieving SOTA robustness and accuracy across varying strictness deployment scenarios through rubric-guided distillation and GRPO risk alignment training.
Frankentext: Stitching Random Text Fragments into Long-Form Narratives: This paper proposes Frankentext, a paradigm where LLMs stitch random human text fragments into coherent long-form narratives under extreme constraints (90% content verbatim-copied from human writing), revealing severe failures of existing AI text detectors in mixed-authorship scenarios (72% of Frankentext is misclassified as human-written).
Reasoning-Based Refinement of Unsupervised Text Clusters with LLMs: A reasoning-based cluster refinement framework that uses LLMs as semantic judges (rather than embedding generators) to verify and restructure unsupervised clustering outputs through coherence verification, redundancy adjudication, and label grounding, significantly improving cluster consistency and human-aligned annotation quality on social media corpora.
Temporal Flattening in LLM-Generated Text: Comparing Human and LLM Writing Trajectories: This paper constructs a longitudinal writing dataset spanning 12 years and discovers "temporal flattening" in LLM-generated text—while lexical diversity is high, temporal drift in semantic and cognitive-emotional dimensions is significantly lower than human writing, achieving 94% accuracy in distinguishing human vs. LLM text using temporal variation patterns alone.
When Personalization Tricks Detectors: The Feature-Inversion Trap in Machine-Generated Text Detection: This paper reveals the "feature-inversion trap" in MGT detectors under personalization—features that distinguish human-written and machine-generated text in the general domain get inverted in the personalized domain, causing detector performance to plummet or even flip. The proposed StyloCheck framework predicts cross-domain performance changes by quantifying detector reliance on inverted features, achieving prediction correlation above 0.85.
Who Wrote This Line? Evaluating the Detection of LLM-Generated Classical Chinese Poetry: This paper constructs the first benchmark for detecting LLM-generated classical Chinese poetry, ChangAn (30,664 poems), systematically evaluating 12 AI detection methods across different text granularities and generation strategies, revealing severe limitations of current Chinese text detectors in the classical poetry domain.

👥 Social Computing¶

Among Us: Language of Conspiracy Theorists on Mainstream Reddit: Analyzing 500 million Reddit comments over 10 years of longitudinal data, this study finds that users active in conspiracy theory communities exhibit detectable unique language patterns in mainstream communities (average 87% classification accuracy), but these patterns are highly context-dependent, with community-specific models outperforming global models by up to 17 percentage points.
Explain the Flag: Contextualizing Hate Speech Beyond Censorship: This paper proposes a hybrid approach combining LLMs with human-curated lexicons in three languages (English/French/Greek) to detect and explain hate speech—the term-based pipeline uses lexicon matching + LLM semantic disambiguation to detect inherently derogatory terms, the term-free pipeline uses LLMs to detect group-targeted content, and both are fused to generate evidence-based explanations.
How Language Models Conflate Logical Validity with Plausibility: A Representational Analysis of Content Effects: Through representational analysis, this work reveals that the concepts of "logical validity" and "plausibility" are highly aligned in LLM hidden layer spaces, causing models to conflate plausibility with validity (content effects). The paper constructs debiasing steering vectors that effectively decouple these two concepts, reducing content effects while improving reasoning accuracy.
Is this chart lying to me? Automating the detection of misleading visualizations: Proposes Misviz (2,604 real-world misleading visualizations) and Misviz-synth (57,665 synthetic visualizations) benchmarks covering 12 misleading types, systematically evaluating MLLMs, rule-based checkers, and image classifiers for misleading chart detection, revealing the task remains highly challenging.
On the Step Length Confounding in LLM Reasoning Data Selection: This paper discovers that naturalness-based LLM reasoning data selection methods suffer from "step length confounding"—systematically preferring samples with longer per-step tokens rather than higher-quality ones. The root cause is that the low probability of reasoning steps' first tokens gets diluted by long steps. Two correction methods are proposed: Aslec-drop (dropping first-token probabilities) and Aslec-casl (causal regression debiasing), improving average accuracy by 6–9%.
Persona-E2: A Human-Grounded Dataset for Personality-Shaped Emotional Responses to Textual Events: Constructs the first large-scale dataset Persona-E2 linking personality traits (MBTI + Big Five) with reader emotional responses, containing 3,111 events × 36 annotators totaling 112K annotations, revealing that LLMs suffer from "personality illusion" when simulating personality-shaped emotional responses, and that Big Five features mitigate this more effectively than MBTI.
SPAGBias: Uncovering and Tracing Structured Spatial Gender Bias in Large Language Models: This paper proposes SPAGBias, a framework that systematically evaluates gender bias in LLMs within urban micro-spatial contexts through three diagnostic layers—explicit, probabilistic, and constructive bias—revealing structured spatial-gender association patterns and tracing how bias is embedded and amplified throughout the model development pipeline.
ToxiTrace: Gradient-Aligned Training for Explainable Chinese Toxicity Detection: ToxiTrace proposes an explainable Chinese toxicity detection method for BERT-class encoders, combining CuSA (LLM-guided weak annotation), GCLoss (gradient-constrained loss), and ARCL (adversarial reasoning contrastive learning) to achieve both high sentence-level classification accuracy and contiguous toxic span extraction while maintaining efficient encoder inference.
ToxReason: A Benchmark for Mechanistic Chemical Toxicity Reasoning via Adverse Outcome Pathway: ToxReason proposes an AOP-based chemical toxicity mechanistic reasoning benchmark that integrates drug-target experimental data with toxicity labels, requiring models to reason from molecular initiating events to organ-level adverse outcomes; a 4B model trained with GRPO reinforcement learning surpasses GPT-5 and other large models in both toxicity prediction (F1 71.4%) and reasoning quality.

🔗 Causal Inference¶

Better and Worse with Scale: How Contextual Entrainment Diverges with Model Size: This paper establishes the first scaling laws for "contextual entrainment," discovering that larger models better resist misinformation in semantic contexts (negative exponent) but more readily copy irrelevant tokens in non-semantic contexts (positive exponent), revealing opposing scaling behaviors of semantic filtering and mechanical copying functions.
CausalDetox: Causal Head Selection and Intervention for Language Model Detoxification: CausalDetox uses Probability of Necessity and Sufficiency (PNS) as causal criterion to precisely locate attention heads causally responsible for toxic content, applying local inference-time intervention and PNS-guided fine-tuning for detoxification, achieving up to 5.34% toxicity reduction while preserving language fluency.
ClimateCause: Complex and Implicit Causal Structures in Climate Reports: ClimateCause constructs the first expert-annotated dataset for complex and implicit causal structures in climate reports (874 causal relations), supporting nested causality, multi-event decomposition, correlation direction, and spatiotemporal context annotation. LLM benchmarking shows causal chain reasoning remains a major challenge.
Cross-Modal Taxonomic Generalization in (Vision-) Language Models: This paper systematically studies whether LMs in VLMs can cross-modally generalize purely text-learned taxonomic knowledge (hypernym relations) to visual inputs, finding that even without any visual-language hypernym supervision, pretrained LMs can identify hypernym categories in images, but this generalization requires visual coherence among category members.
Dialectic-Med: Mitigating Diagnostic Hallucinations via Counterfactual Adversarial Multi-Agent Debate: Dialectic-Med, inspired by Popperian falsificationism, uses three-agent adversarial dialectical reasoning (proposer for diagnostic hypotheses, opponent with visual falsification module for proactively retrieving contradictory visual evidence, and mediator with weighted consensus graph), achieving SOTA on MIMIC-CXR-VQA, VQA-RAD, and PathVQA with 12.5% explanation faithfulness improvement.
Imperfectly Cooperative Human-AI Interactions: Comparing the Impacts of Human and AI Attributes in Simulated and User Studies: Through 2000 LLM simulations and a 290-person user study in a dual-framework experiment, this paper compares the impacts of human personality traits and AI design attributes in imperfectly cooperative scenarios (hiring negotiation, partially honest trading), finding that personality traits dominate in simulations while AI transparency is the key driver in real user experiments.
iTAG: Inverse Design for Natural Text Generation with Accurate Causal Graph Annotations: iTAG generates text with simultaneously high causal graph annotation accuracy (F1≥0.95) and naturalness (near-random detection rate) through a three-phase inverse design pipeline (parameterized causal graph construction → CoT-based concept assignment → structure-preserving text generation), serving as a practical substitute for real annotated data for benchmarking text causal discovery algorithms.
Parallel Universes, Parallel Languages: A Comprehensive Study on LLM-based Multilingual Counterfactual Example Generation: This paper systematically studies LLM multilingual counterfactual generation across six languages (English, Arabic, German, Spanish, Hindi, Swahili), comparing direct generation and translation paths. Translation paths yield higher label flip rates but require more edits, four common error patterns are identified, and multilingual counterfactual data augmentation outperforms cross-lingual augmentation, especially for low-resource languages.

🕸️ Graph Learning¶

AgentGL: Towards Agentic Graph Learning with LLMs via Reinforcement Learning: AgentGL is the first RL-based agentic graph learning (AGL) framework that enables LLM agents to autonomously navigate text-attributed graphs (TAGs) via graph-native search tools, achieving up to 17.5% and 28.4% absolute accuracy gains on node classification and link prediction respectively.
ARK: Answer-Centric Retriever Tuning via KG-augmented Curriculum Learning: ARK filters positive samples through three-dimensional answer sufficiency scoring (Forward + Backward + Retriever alignment) and generates progressively difficult hard negatives via LLM-constructed knowledge graphs for curriculum contrastive learning, averaging +14.5% F1 across 10 datasets.
AutoPKG: An Automated Framework for Dynamic E-commerce Product-Attribute Knowledge Graph Construction: AutoPKG is a multi-agent LLM framework that automatically constructs Product-Attribute knowledge graphs (PKGs) from multimodal e-commerce content, using Type Induction Agent, Attribute Key Discovery Agent, Attribute Value Extraction Agent, and centralized KGD Decision Agent, achieving 0.953 WKE for types and +7.89% recommendation GMV in online A/B tests on Lazada.
Comparing Human and Large Language Model Interpretation of Implicit Information: This paper proposes the Implicit Information Extraction (IIE) task and a three-stage LLM pipeline (information extraction → reasoning verification → temporal analysis), building structured knowledge graphs to represent implicit textual meaning. Crowdsourced human comparisons reveal LLMs are more conservative in socially-rich contexts but humans are more conservative in short factual contexts.
From Nodes to Narratives: Explaining Graph Neural Networks with LLMs and Graph Context: Gspell is a lightweight post-hoc explanation framework that projects GNN node embeddings into LLM embedding space and constructs hybrid prompts (soft prompts + text), enabling LLMs to directly reason over GNN internal representations and generate natural language explanations with explanation subgraphs, achieving a good balance of faithfulness and interpretability on text-attributed graphs.
Graph-Based Alternatives to LLMs for Human Simulation: GEMS models closed-form human behavior simulation as link prediction on heterogeneous graphs with three node types (subgroups, individuals, choices) and two bidirectional relations, matching or surpassing strong LLM baselines across three datasets and three evaluation settings while using 1000x fewer parameters.
LLMs Underperform Graph-Based Parsers on Supervised Relation Extraction for Complex Graphs: Across six RE datasets comparing four LLMs (7B-70B) against a lightweight graph parser (124M parameters), graph parsers consistently and significantly outperform LLMs when average relation graph edges exceed ~18, with F1 gaps reaching 13.2 points on the most complex ERFGC dataset, revealing fundamental LLM limitations in complex linguistic graph structure extraction.
Which Bird Does Not Have Wings: Negative-Constrained KGQA with Schema-Guided Semantic Matching and Self-Directed Refinement: This paper defines the NEST KGQA task and NestKGQA dataset for negation-constrained knowledge graph QA, designs PyLF (Python-format logical form) for clear negation expression, and proposes CUCKOO framework with constraint-aware draft generation, schema-guided semantic matching, and self-directed refinement, achieving efficient and precise answers for multi-constraint questions in few-shot settings.

⚡ LLM Efficiency¶

BOSCH: Black-Box Binary Optimization for Short-Context Attention-Head Selection in LLMs: BOSCH is a training-free head-level SWA mixing method that models SWA head selection as a large neighborhood search problem decomposed into three stages (layer importance probing → adaptive ratio allocation → grouped head selection), systematically outperforming layer-level heuristics and 6 static head-level methods across 4 models and 4 ratio settings.
HumanLLM: Benchmarking and Improving LLM Anthropomorphism via Human Cognitive Patterns: HumanLLM models 244 psychological patterns (100 personality traits + 144 social cognitive patterns) as interacting causal forces rather than isolated labels, constructs 11,359 multi-pattern interaction scenarios, achieves \(r=0.90\) human alignment through dual-layer checklist evaluation, and HumanLLM-8B surpasses Qwen3-32B in multi-pattern dynamics at 4x fewer parameters.
Multi-Drafter Speculative Decoding with Alignment Feedback: MetaSD is a unified framework integrating multiple heterogeneous drafters into speculative decoding, modeling drafter selection as a multi-armed bandit problem with Block Divergence (BD) reward signals to dynamically select the most aligned drafter, consistently outperforming single-drafter methods in both black-box and white-box configurations.
Native Hybrid Attention for Efficient Sequence Modeling: Native Hybrid Attention (NHA) concatenates linear RNN long-term memory slots with sliding window short-term precise tokens and processes them through a single softmax attention, achieving native intra-layer and inter-layer hybridization — dynamically allocating long-short attention weights without extra fusion parameters, outperforming Transformer and other hybrid baselines on recall-intensive and commonsense reasoning tasks.
Planning Beyond Text: Graph-based Reasoning for Complex Narrative Generation: PLOTTER shifts narrative planning from text representation to graph structure (event graph + character graph), diagnosing and repairing narrative flaws through multi-agent Evaluate-Plan-Revise iterative cycles on graph topology, significantly outperforming existing methods on narrativity, characterization, and dramatic tension.
SciCoQA: Quality Assurance for Scientific Paper–Code Alignment: SciCoQA is the first benchmark for detecting semantic discrepancies between scientific papers and their code implementations, containing 635 discrepancy instances (92 real + 543 synthetic). Evaluation of 22 LLMs reveals the strongest model detects only 46.7% of real discrepancies, uncovering a critical capability gap in automated scientific quality assurance.
SpecBound: Adaptive Bounded Self-Speculation with Layer-wise Confidence Calibration: SpecBound suppresses shallow-layer false high-confidence predictions via layer-wise temperature annealing and designs a bounded speculation algorithm to adaptively control draft depth and width, achieving up to 2.33x inference acceleration while maintaining lossless output.
Speculative Verification: Exploiting Information Gain to Refine Speculative Decoding: Speculative Verification (SV) introduces a companion model of equal size to the drafter, using draft-companion distribution similarity \(S\) and companion acceptance probability \(A\) to predict target model acceptance probability, dynamically selecting optimal verification length to maximize goodput, achieving average 1.4x and up to 1.9x speedup over standard speculative decoding in large-batch inference.

🤖 Robotics & Embodied AI¶

Can AI-Generated Persuasion Be Detected? Persuaficial Benchmark and AI vs. Human Linguistic Differences: Persuaficial is a high-quality multilingual benchmark covering six languages for AI-generated persuasive text. Systematic evaluation reveals that subtle AI persuasion is harder to detect than human persuasion (F1 drops ~20%), while intensified persuasion is paradoxically easier to detect.
Debating the Unspoken: Role-Anchored Multi-Agent Reasoning for Half-Truth Detection: RADAR uses role-anchored (politician vs scientist) multi-agent debate to detect half-truths — statements that are factually correct but misleading due to omitted context — with dual-threshold adaptive early stopping, consistently outperforming single-agent and traditional multi-agent baselines under noisy retrieval conditions.
DeCoVec: Building Decoding Space based Task Vector for Large Language Models via In-Context Learning: DeCoVec constructs task vectors in the decoding space (output logits) by contrasting few-shot and zero-shot logit distributions: \(\mathbf{v}_\mathcal{T}^t = \mathbf{z}_{\text{icl}}^t - \mathbf{z}_{\text{zs}}^t\), injecting them into decoding via \(\tilde{\mathbf{z}}^t = \mathbf{z}_{\text{de}}^t + \lambda \cdot \mathbf{v}_\mathcal{T}^t\), achieving up to +5.50 average accuracy improvement over standard few-shot baselines across 7 LLMs without any training.
GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models: GRASPrune proposes a globally budget-constrained structured pruning framework that enforces hard mask budget constraints at every training step via Projected Straight-Through Estimator (Projected STE), jointly pruning FFN channels and KV head groups, achieving 12.18 PPL at 50% parameter retention on LLaMA-2-7B with only 6 minutes of single A100 training.
On Safety Risks in Experience-Driven Self-Evolving Agents: This paper systematically studies safety risks of experience-driven self-evolving agents, finding that even experience accumulated solely from harmless tasks causes significant safety degradation (ASR increases 13-49%). The root cause is the execution-oriented nature of accumulated experience, which reinforces action-taking over refusal behaviors.
Reasoning Hijacking: The Fragility of Reasoning Alignment in Large Language Models: This paper introduces "Reasoning Hijacking," a new attack paradigm that manipulates LLM reasoning logic by injecting false decision criteria into the data channel rather than changing task goals, achieving high attack success rates while bypassing intent-detection-based defenses.
VLN-NF: Feasibility-Aware Vision-and-Language Navigation with False-Premise Instructions: VLN-NF is the first benchmark requiring VLN agents to identify false-premise instructions and output NOT-FOUND in 3D partially observable environments. The paper also proposes REV-SPL evaluation metric and ROAM two-stage hybrid framework, achieving 6.1 REV-SPL (+45% over supervised baselines).
XOXO: Stealthy Cross-Origin Context Poisoning Attacks against AI Coding Assistants: This paper reveals a design vulnerability in AI coding assistants' automatic context collection and proposes Cross-Origin Context Poisoning (XOXO) attacks: poisoning shared codebases via semantics-preserving code transformations (e.g., variable renaming) causes assistants like GitHub Copilot to unknowingly generate vulnerable code, achieving 73.20% average ASR across 8 SOTA models.

🖼️ Image Restoration¶

CreditDecoding: Accelerating Parallel Decoding in Diffusion Large Language Models with Trace Credit: CreditDecoding is a training-free parallel decoding acceleration method that accumulates token-level historical evidence (trace credit) to boost correct but low-confidence tokens, achieving up to 5.48x speedup with +0.48 accuracy gain on LLaDA-8B-Instruct.
Diffusion-CAM: Faithful Visual Explanations for dMLLMs: Diffusion-CAM is the first interpretability method for diffusion-based multimodal LLMs (dMLLMs), extracting structurally valid intermediate representations from denoising trajectories with four post-processing modules (adaptive kernel denoising, distribution-aware confidence gating, contextual background decay, single-instance causal debiasing), significantly outperforming autoregressive CAM baselines on COCO Caption and GranDf.
Learning to Extract Rational Evidence via Reinforcement Learning for Retrieval-Augmented Generation: EviOmni learns to extract rational evidence from retrieved documents through a "reason-then-extract" paradigm: integrating evidence reasoning and extraction into a unified trajectory, using knowledge token masking to prevent information leakage, and optimizing via GRPO with verifiable rewards, achieving accuracy surpassing full-text retrieval at ~38x compression across 5 benchmarks.
Lost in Diffusion: Uncovering Hallucination Patterns and Failure Modes in Diffusion Large Language Models: This work presents the first systematic comparison of hallucination patterns between diffusion LLMs (dLLMs) and their autoregressive (AR) counterparts, revealing that current dLLMs exhibit higher hallucination tendency and identifying three diffusion-specific failure modes: premature termination, incomplete denoising, and context intrusion.
Purging the Gray Zone: Latent-Geometric Denoising for Precise Knowledge Boundary Awareness: GeoDe trains linear probes in LLM latent space to construct truth hyperplanes, using sample-to-hyperplane geometric distance as confidence signals to filter high-quality abstention fine-tuning data, effectively eliminating "gray zone" noise near decision boundaries and significantly improving model truthfulness and reliability.
Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning: This paper systematically analyzes sources and amplification mechanisms of spurious signals in test-time RL (TTRL) — mid-frequency answers constitute the ambiguous zone as the primary noise source, while GRPO's within-group normalization amplifies these spurious signals — and proposes DDRL with balanced sampling, fixed advantage values, and consensus offline refinement, achieving 15.3% relative improvement on Qwen2.5-Math-1.5B.

📚 Pretraining¶

Commonsense Knowledge with Negation: A Resource to Enhance Negation Understanding: This paper proposes an automated method for augmenting existing commonsense knowledge bases with negation, constructing a large-scale negated commonsense corpus (¬Atomic and ¬Anion) containing over 2 million triples, and demonstrates that pretraining on this corpus improves LLMs' negation understanding capabilities.
Compact Example-Based Explanations for Language Models: This paper proposes the Selection Relevance Score, a retraining-free metric for evaluating the quality of training sample subsets as example-based explanations. It demonstrates that the commonly used "select top-k by influence" strategy frequently underperforms random selection, and introduces a new strategy that balances influence and representativeness.
SAGE: Sign-Adaptive Gradient for Memory-Efficient LLM Optimization: This paper proposes the SAGE optimizer, which addresses the "embedding layer dilemma" of lightweight optimizers by combining Lion-style sign update directions with an \(O(d)\)-memory adaptive damping scaling factor \(\mathbf{H}_t\). SAGE achieves new state-of-the-art perplexity on Llama models (up to 1.3B parameters) with significantly reduced optimizer memory overhead.
SCRIPT: A Subcharacter Compositional Representation Injection Module for Korean Pre-Trained Language Models: This paper proposes SCRIPT, a model-agnostic plug-and-play module that injects subcharacter (Jamo) compositional knowledge from the Korean Hangul writing system into the embedding layer of existing subword-level PLMs via a dual-channel strategy. Without requiring re-pretraining, SCRIPT yields consistent improvements on Korean NLU/NLG tasks and enables the embedding space to better capture morphosyntactic regularities and semantic variations.
Working Memory Constraints Scaffold Learning in Transformers under Data Scarcity: This paper integrates human working memory constraints (fixed window, exponential decay, logistic decay, and primacy-recency effects) into the GPT-2 attention mechanism, training from scratch on developmentally plausible small-scale corpora (10M/100M words). The results demonstrate that these constraints significantly improve grammatical accuracy and human reading time predictability under data scarcity, while also promoting functional specialization of attention heads.

🎯 Object Detection¶

E2E-GMNER: End-to-End Generative Grounded Multimodal Named Entity Recognition: This paper proposes E2E-GMNER, the first end-to-end GMNER framework that unifies entity recognition, semantic classification, visual grounding, and implicit knowledge reasoning within a single multimodal large language model. The framework employs CoT reasoning to adaptively assess the utility of visual and knowledge cues, and introduces Gaussian Risk-aware Bounding box Perturbation (GRBP) to enhance the robustness of generative bounding box prediction.
Evaluating Memory Capability in Continuous Lifelog Scenario: This paper proposes LifeDialBench, a benchmark for evaluating memory capabilities in continuous lifelog scenarios, comprising EgoMem (7 days of real-world data) and LifeMem (1 year of simulated data). An online evaluation protocol is introduced to enforce temporal causality. Counterintuitively, a simple RAG baseline consistently outperforms all complex memory systems.
Evolutionary Negative Module Pruning for Better LoRA Merging: This paper proposes ENMP, a method that leverages evolutionary search to identify and prune "negative modules" that degrade performance during LoRA merging. Designed as a plug-and-play enhancement, ENMP consistently improves existing merging algorithms across both NLP and vision domains.
GigaCheck: Detecting LLM-generated Content via Object-Centric Span Localization: GigaCheck is proposed as a dual-strategy framework: document-level classification via fine-tuned LLM, and span-level detection that innovatively treats AI-generated text spans as "objects," employing a DETR-like architecture for end-to-end character-level localization.
Retrievals Can Be Detrimental: Unveiling the Backdoor Vulnerability of Retrieval-Augmented Diffusion Models: This paper proposes BadRDM, the first backdoor attack framework targeting retrieval-augmented diffusion models (RDMs). By maliciously fine-tuning the retriever via contrastive learning, it establishes a shortcut from trigger tokens to toxic proxy images, achieving attack success rates of 90.9% and 96.4% on class-conditional and text-to-image (T2I) tasks respectively, while preserving benign generation quality.

📈 Time Series¶

A Unified Framework for Modeling Heterogeneous Financial Data via Dual-Granularity Prompting: This paper proposes FinLangNet, a dual-module framework comprising DeepFM for static feature processing and a Transformer with a dual-granularity prompting mechanism for sequential behavior modeling, enabling multi-scale credit risk prediction. Upon deployment on the Didi Finance platform, the system achieves a 6.3 pp improvement in KS and a 9.9% reduction in bad debt rate.
Learning Uncertainty from Sequential Internal Dispersion in Large Language Models: This paper proposes the SIVR framework, which computes internal variance statistics (generalised variance, circular variance, and token entropy) across layers of LLM hidden states as token-level features, and aggregates full-sequence patterns via a lightweight Transformer encoder to estimate uncertainty and detect hallucinations, achieving significant improvements over baselines with stronger generalization.
STK-Adapter: Incorporating Evolving Graph and Event Chain for Temporal Knowledge Graph Extrapolation: This paper proposes STK-Adapter, which embeds three MoE modules at every layer of an LLM—ST-MoE for capturing spatiotemporal structure, EA-MoE for modeling event chain semantics, and CMA-MoE for deep cross-modal alignment—to address the spatiotemporal information loss and layer-wise dilution caused by shallow alignment between TKG embeddings and LLMs in existing methods, achieving significant improvements over SOTA on four benchmark datasets.
Temporal Leakage in Search-Engine Date-Filtered Web Retrieval: A Retrospective Forecasting Case Study: This paper presents a systematic audit of date filters in Google and DuckDuckGo, revealing that search engine date filtering critically fails in retrospective forecasting evaluation — 71% (Google) and 81% (DuckDuckGo) of questions have at least one page containing significant post-cutoff information leakage, causing the prediction Brier score to drop artificially from 0.24 to 0.10.
Time-RA: Towards Time Series Reasoning for Anomaly Diagnosis with LLM Feedback: This paper introduces Time-RA, a new task that upgrades time series anomaly detection from binary classification to generative reasoning diagnosis (detection + classification + causal explanation). It constructs RATs40K, the first multimodal benchmark comprising ~40K samples across 10 domains and 20 anomaly types, and validates the feasibility of this paradigm through an AI feedback annotation pipeline and LLM fine-tuning.

✏️ Knowledge Editing¶

Aligning Language Models with Real-time Knowledge Editing: This paper introduces CRAFT (a continuously updated Chinese financial knowledge editing dataset) and KEDAS (a knowledge editing alignment paradigm based on diverse edit augmentation and self-adaptive inference), addressing the problem that existing knowledge editing methods cannot simultaneously achieve high editing success rate, locality, and portability in real-time scenarios.
CLaRE-ty Amid Chaos: Quantifying Representational Entanglement to Predict Ripple Effects in LLM Editing: CLARE proposes a lightweight representation-level method that quantifies the entanglement between facts through forward activations of a single intermediate layer to predict ripple effects of model editing, achieving a 62.2% average Spearman correlation improvement over gradient methods while being 2.74× faster and requiring 2.85× less memory.
EvoEdit: Evolving Null-space Alignment for Robust and Efficient Knowledge Editing: This paper proposes EvoEdit, which achieves large-scale sequential knowledge editing through dynamically evolving null-space projectors. It efficiently injects new knowledge while preserving existing knowledge, maintaining SOTA performance at the 10K edit scale while being 3.5× faster than AlphaEdit.
FABLE: Fine-grained Fact Anchoring for Unstructured Model Editing: This paper discovers that existing unstructured model editing methods can holistically recall edited text but cannot perform fine-grained fact access, and proposes FABLE, a framework that uses a two-stage hierarchical strategy to anchor fine-grained facts in shallow layers and integrate holistic narratives in deep layers, along with the UnFine diagnostic benchmark for systematic evaluation.

✂️ Segmentation¶

AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation: This paper proposes AnchorSeg, which reformulates reasoning segmentation as a structured conditional generation process based on language-grounded query banks. By explicitly decoupling spatial localization and semantic reasoning through anchor queries, paired with a Token-Mask Cyclic Consistency training objective, it achieves SOTA on ReasonSeg (67.7% gIoU, 68.1% cIoU).
BoundRL: Efficient Structured Text Segmentation through Reinforced Boundary Generation: BoundRL redefines structured text segmentation as a boundary generation task — generating only each segment's start tokens rather than the complete text, reducing output tokens by 90% and eliminating hallucination risk. Combined with a dual-objective reward function and selective perturbation strategy for RLVR training, a 1.7B model surpasses Claude-4 Sonnet's few-shot performance.
Hierarchical Policy Optimization for Simultaneous Translation of Unbounded Speech: This paper proposes Hierarchical Policy Optimization (HPO), which post-trains LLM-based simultaneous speech translation models through hierarchical reward design, suppressing latency optimization when translation quality falls below threshold, achieving +7 COMET translation quality improvement at 1.5-second latency.
TemporalVLM: Video LLMs for Temporal Reasoning in Long Videos: This paper proposes TemporalVLM, which extracts local fine-grained temporal features through a time-aware segment encoder (overlapping sliding Video Q-Former + fusion module), then aggregates global long-range dependencies using BiLSTM. This is the first work to introduce LSTM into Video LLMs, outperforming prior methods on four tasks: dense video captioning, temporal grounding, highlight detection, and action segmentation.

🛡️ AI Safety¶

When Bigger Isn't Better: A Comprehensive Fairness Evaluation of Political Bias in Multi-News Summarisation: This paper constructs FairNews, the first multi-document news summarization dataset with political leaning labels, and evaluates 13 LLMs through a five-dimensional fairness evaluation framework, finding that mid-sized models outperform larger ones in fairness and efficiency, and that entity sentiment similarity is the most resistant dimension to prompt-based debiasing.
XLSR-MamBo: Scaling the Hybrid Mamba-Attention Backbone for Audio Deepfake Detection: This paper proposes the XLSR-MamBo framework, systematically exploring four topology designs and multiple SSM variants (Mamba2, Hydra, GDN) for Mamba-Attention hybrid architectures in audio deepfake detection, where MamBo-3-Hydra achieves competitive performance across multiple benchmarks through Hydra's native bidirectional modeling, and increasing backbone depth effectively mitigates shallow model instability.
XQ-MEval: A Dataset with Cross-lingual Parallel Quality for Benchmarking Translation Metrics: This paper constructs XQ-MEval, the first translation evaluation benchmark with cross-lingual parallel quality, using semi-automated MQM error injection to generate pseudo-translations with controllable quality, empirically revealing cross-lingual scoring bias in automatic evaluation metrics for the first time, and proposing an LGN normalization strategy that effectively calibrates multilingual metric evaluation.

📡 Signal & Communications¶

PolicyLLM: Towards Excellent Comprehension of Public Policy for Large Language Models: This paper proposes PolicyBench (a 21K-question cross-system policy understanding benchmark spanning China and the US) and PolicyMoE (a cognitive-level-aligned Mixture of Experts model), systematically evaluating 11 SOTA LLMs across memory/understanding/application cognitive levels and finding that models perform well on structured reasoning but remain weak on abstract policy concepts.
Solver-Independent Automated Problem Formulation via LLMs for High-Cost Simulation-Driven Design: This paper proposes APF (Automated Problem Formulation), a solver-independent framework that uses LLMs to translate engineers' natural language design requirements into executable mathematical optimization models. Through innovative data generation and test instance annotation pipelines, APF overcomes the difficulty of using solver feedback for data filtering in high-cost simulation scenarios, significantly outperforming existing methods on antenna design tasks.
UCS: Estimating Unseen Coverage for Improved In-Context Learning: This paper proposes UCS (Unseen Coverage Selection), a training-free subset-level coverage prior based on the Smoothed Good-Turing estimator that regularizes existing ICL example selection methods by estimating the number of unobserved latent clusters in candidate example sets, improving accuracy by 2-6% on intent classification and reasoning tasks.

🎬 Video Generation¶

Accelerating Training of Autoregressive Video Generation Models via Local Optimization with Representation Continuity: This paper proposes a Local Optimization + Representation Continuity (ReCo) training strategy that optimizes within local windows while constraining smooth transitions of hidden states, achieving 2× training speedup for autoregressive video generation models without sacrificing generation quality.
OSCBench: Benchmarking Object State Change in Text-to-Video Generation: This paper proposes OSCBench — the first benchmark dedicated to evaluating object state change (OSC) capabilities in text-to-video (T2V) models. Built on cooking scenarios with 1,120 prompts covering conventional/novel/compositional scenarios, it reveals that even the strongest T2V model achieves only 0.786 OSC accuracy.
Self-Correcting Text-to-Video Generation with Misalignment Detection and Localized Refinement: This paper proposes VideoRepair, the first training-free, model-agnostic text-to-video self-correction framework that detects fine-grained text-video misalignment via MLLM, preserves correct regions, and selectively repairs problematic regions, consistently improving alignment quality across four T2V backbone models on EvalCrafter and T2V-CompBench.

🔄 Self-Supervised Learning¶

[b] = [d] − [t] + [p]: Self-supervised Speech Models Discover Phonological Vector Arithmetic: This paper systematically demonstrates that linear phonological feature vectors exist in the representation space of self-supervised speech models (S3M), satisfying word2vec-like vector arithmetic relations, with their scaling factors continuously correlating with acoustic measurements.
ConlangCrafter: Constructing Languages with a Multi-Hop LLM Pipeline: This paper proposes ConlangCrafter, a multi-hop LLM pipeline that decomposes constructed language (conlang) design into three modular stages — phonology, grammar, and lexicon — ensuring typological diversity through randomness injection and internal consistency through self-refinement loops, along with an automatic evaluation framework incorporating typological diversity analysis and translation consistency assessment.

🧑 Human Understanding¶

ForgeryTalker: Generating Attribution Reports for Manipulated Facial Images: The paper proposes the Forgery Attribution Report Generation task, constructs the MMTT dataset with 152,217 samples (the first large-scale facial forgery dataset providing both pixel-level masks and human-written text descriptions), and introduces the ForgeryTalker end-to-end baseline that jointly generates localization masks and attribution reports via a shared encoder and dual decoders (mask + language model), achieving 59.3 CIDEr and 73.67 IoU.

📐 Optimization & Theory¶

CLewR: Curriculum Learning with Restarts for Machine Translation Preference Learning: The paper proposes CLewR (Curriculum Learning with Restarts), a strategy that sorts data from easy to hard and restarts the curriculum at each epoch during preference optimization training, effectively mitigating catastrophic forgetting and consistently improving machine translation performance across multiple model families (Gemma2, Qwen2.5, Llama3.1) and preference optimization algorithms (DPO, CPO, ARPO).

🛰️ Remote Sensing¶

MONETA: Multimodal Industry Classification through Geographic Information with Multi Agent Systems: The paper proposes MONETA, the first multimodal industry classification benchmark combining text (websites, Wikipedia, Wikidata) and geospatial data (OpenStreetMap, satellite imagery), with zero-shot and multi-turn multi-agent training-free pipelines using open-source and proprietary MLLMs achieving 62.10%-74.10% accuracy on 20-class NACE industry classification, with multi-turn design improving up to 22.80%.

📂 Others¶

Agree, Disagree, Explain: Decomposing Human Label Variation in NLI through the Lens of Explanations: This paper extends the LiTEx reasoning taxonomy from "label-consistent, explanation-variant" settings to label-disagreement scenarios, finding that annotators may share similar reasoning strategies despite assigning different labels, and that reasoning category agreement better reflects the semantic similarity of explanations than label agreement alone.
Are Large Language Models Economically Viable for Industry Deployment?: This paper proposes Edge-Eval, a framework that evaluates LLMs across their full deployment lifecycle on legacy T4 GPUs using five deployment metrics—economic break-even, intelligence-per-watt, system density, cold-start tax, and quantization fidelity. The framework reveals that sub-2B models comprehensively outperform 7B models on both economic and ecological dimensions, and uncovers the counterintuitive finding that QLoRA, while reducing memory by ~60%, can increase energy consumption by up to 7×.
Beyond Accuracy: Unveiling Inefficiency Patterns in Tool-Integrated Reasoning: This paper proposes PTE (Prefill Token Equivalents), a hardware-aware efficiency metric for tool-integrated reasoning (TIR) that unifies the costs of internal reasoning and external tool use. Through large-scale experiments, the paper identifies four inefficiency patterns in TIR: confirmatory tool use, tool mixing, lack of tool priors, and tool format collapse.
Dynamics of Cognitive Heterogeneity: Investigating Behavioral Biases in Multi-Stage Supply Chains with LLM-Based Simulation: This paper deploys LLM agents (DeepSeek/GPT series) in the classic beer distribution game to simulate multi-stage supply chains, systematically investigating how cognitive heterogeneity (differences in reasoning capability) affects system behavior. The findings demonstrate that LLM agents can reproduce human-observed bullwhip effects and myopic behaviors, and that information sharing effectively mitigates these adverse effects.
Reliable Evaluation Protocol for Low-Precision Retrieval: This paper identifies that low-precision retrieval systems (e.g., binarized or quantized embeddings) suffer from a large number of spurious ties due to reduced score granularity, leading to highly unstable evaluation results. Two complementary strategies are proposed—High-Precision Scoring (HPS) and Tie-aware Retrieval Metrics (TRM)—to enable more reliable and consistent evaluation of low-precision retrieval systems.