Skip to content

💬 ACL2026 Accepted Papers

1419 ACL2026 paper notes covering LLM Safety (115), LLM Evaluation (97), Multimodal VLM (83), LLM Agent (82), LLM Reasoning (82), Information Retrieval & RAG (73), Audio & Speech (72), Multilingual & Translation (64) and other 38 areas. Each note has TL;DR, motivation, method, experiments, highlights, and limitations — 5-minute reads of core ideas.


💡 LLM Reasoning (82)

Accurate Legal Reasoning at Scale: Neuro-Symbolic Offloading and Structural Auditability for Robust Legal Adjudication

This paper proposes the Amortized Intelligence paradigm: treating the LLM as a "one-time compiler" to compile legal contracts into a deterministic Directed Acyclic Graph (DAG) intermediate representation called DACL. At runtime, a lightweight agent schedules a symbolic engine for execution, achieving 99.5% accuracy across 400 real-world contract events. Compared to large reasoning models like GPT-5.2/Claude/Gemini, accuracy on complex contracts jumps from 22-46% to 98%, while token consumption is reduced by 9.9x.

Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning

This paper proposes APMPO, which unifies GRPO (arithmetic mean) and GMPO (geometric mean) objectives using a "power-mean" controlled by the current mean reward. In conjunction with an adaptive clip range based on reward stability, APMPO allows RLVR training to dynamically switch between "amplifying rare high rewards" and "emphasizing consistency" across different stages, consistently outperforming GRPO, DAPO, and GMPO on 9 mathematical, SQL, and multimodal reasoning benchmarks.

AIM-CoT: Active Information-driven Multimodal Chain-of-Thought for Vision-Language Reasoning

The AIM-CoT framework is proposed to address two core issues in Interleaved Multimodal Chain-of-Thought (I-MCoT)—"what to see" and "when to see"—through Information Foraging Theory-driven Active Visual Probing (AVP) and an attention-shift-based Dynamic Attention-shift Trigger (DAT).

Budget-Aware Anytime Reasoning with LLM-Synthesized Preference Data

This paper proposes a budget-aware anytime reasoning framework and an Anytime Index metric to quantify the quality-efficiency trade-off of LLMs under limited token budgets. It also designs a reasoning-time self-improvement method (PDP) based on LLM-synthesized preference data, significantly improving the quality of intermediate and final solutions across planning, mathematics, and science QA tasks.

C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences

Addressing the double-edged sword where "self-generated rubrics often mislead reward models," the authors use language model (LM) likelihood margins to automatically label 16 self-sampled rubrics as "helpful/misleading" pairs. They then train a cooperative rubric generator via DPO and a "critical" verifier via GRPO, which assesses rubric reliability before making judgments. Using only binary preference data, C2 Improves reasoning RM performance by up to 6.5 points on RM-Bench and increases downstream DPO LC win rates by 6 points. Notably, an 8B model using self-generated rubrics matches the performance of using rubrics from a \(4\times\) larger model (Qwen3-32B).

Calibration-Aware Policy Optimization for Reasoning LLMs

The authors first prove that the "reward-only" advantage estimation in GRPO-like algorithms is equivalent to an AUC-inconsistent surrogate (\(\phi(t)=-t\), violating scale-invariance), which leads to a continuous degradation of relative calibration (perplexity AUC) even as accuracy increases. Accordingly, they propose CAPO: replacing the advantage with a "pairwise, uncertainty-aware" form based on a logistic AUC consistent surrogate, further enhanced by denoising masking using reference-model PPL. On Qwen2.5-Math 1.5B/7B, CAPO achieves +15~25% calibration improvements with comparable or superior accuracy to GRPO, and an additional 5% gain in AIME inference-time scaling.

Can Reasoning Path still be Effective as Input? Bridging Post-Reasoning to Chain-of-Thought Compression

This paper proposes post-reasoning and UCoT: a lightweight compressor first generates soft tokens representing the reasoning path via a single forward pass, and then an executor uses these soft tokens as input context to perform short-output reasoning, significantly reducing CoT tokens and latency while maintaining reasoning accuracy.

Chain-of-Thought as a Lens: Evaluating Structured Reasoning Alignment between Human Preferences and Large Language Models

This paper proposes Alignment Score—a semantic-level metric based on a semantic entropy matrix—to quantify reasoning alignment by comparing intermediate steps of model-generated chains-of-thought with human-preferred reference chains. The study finds that Alignment Score is highly correlated with task accuracy, readability, and coherence, identifying 2-hop reasoning as the peak depth for alignment.

ChAIRO: Contextual Hierarchical Analogical Induction and Reasoning Optimization for LLMs

The authors propose ChAIRO, a framework for contextual hierarchical analogical induction and reasoning optimization. Through a three-stage pipeline (analogical case generation → rule induction → rule-injected fine-tuning), the framework enables LLMs to autonomously generate analogical cases and induce explicit moderation rules for content moderation. It achieves a 4.5% \(F1\) improvement over single-instance rule generation and a 2.3% improvement over static RAG.

CoAct: Co-Active LLM Preference Learning with Human-AI Synergy

CoAct utilizes self-consistency during preference alignment to partition unlabeled samples into "high-consistency" and "low-consistency" sets. It then employs k-NN distance to identify "self-consistent yet potentially incorrect" risky samples from the high-consistency set for Oracle labeling, while the remaining high-consistency samples are treated as AI self-labeled data. Finally, Oracle-verified samples are used as in-context demos to generate new instructions. By integrating human and AI supervision into a single DPO loop, CoAct achieves a 4–8 percentage point improvement over state-of-the-art baselines on GSM8K, MATH, and WebInstruct.

Browse all 82 LLM Reasoning papers →


🦾 LLM Agent (82)

AdaRubric: Task-Adaptive Rubrics for Reliable LLM Agent Evaluation and Reward Learning

This paper identifies that "LLM-as-Judge + Fixed Rubrics" (Helpfulness/Safety/Fluency) are poorly matched for evaluating goal-oriented agent trajectories. It proposes AdaRubric—where an LLM automatically generates task-specific N-dimensional evaluation rubrics based on task descriptions, followed by confidence-weighted step-by-step evaluations to produce dense reward signals. A DimensionAwareFilter is designed for DPO data construction to prevent "dimension masking." Evaluated on WebArena/ToolBench/AgentBench, it achieves a Pearson \(r=0.79\) and brings a \(+6.8\) to \(+8.5\%\) task success rate improvement through DPO training.

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

AgencyBench is proposed as a comprehensive benchmark comprising 138 real-world tasks to evaluate 6 core agent capabilities. Each scenario averages 90 tool calls and 1 million tokens, achieving fully automated evaluation via user simulation agents and Docker sandboxes.

Agent-GWO: Collaborative Agents for Dynamic Prompt Optimization in Large Language Models

This paper proposes Agent-GWO, which introduces the leader-follower mechanism of the Grey Wolf Optimizer into a multi-agent framework to jointly optimize prompt templates and decoding hyperparameters (temperature, top-p, etc.). It consistently out-performs existing prompt optimization methods across 11 mathematical and hybrid reasoning benchmarks.

AnchorMem: Anchored Facts with Associative Contexts for Building Memory in Large Language Models

The AnchorMem memory framework is proposed, inspired by the Proust phenomenon. It decouples the retrieval unit (atomic facts) from the generation context (original interactions) and connects fragmented memories via an associative event graph. It significantly outperforms existing systems like A-Mem and Mem0 on the LoCoMo benchmark.

AVA: Attentive VLM Agent for Mastering StarCraft II

This paper proposes AVACraft—the first StarCraft II multimodal benchmark supporting both MARL and VLM decision-making paradigms (21 scenarios / RGB + Text + Structured State). It introduces the VLM baseline AVA (Multimodal Priority Reasoning + RAG + Dynamic Role Assignment). Experiments demonstrate that while MARL achieves only a 19–27% win rate after 5M training steps in base 3m scenarios, zero-shot VLM reaches 75–90%.

BAPO: Boundary-Aware Policy Optimization for Reliable Agentic Search

Addressing the reliability issue where RL-trained agentic search models rarely say "I DON'T KNOW," leading to hallucinations, BAPO introduces "group-based boundary-aware rewards + adaptive reward modulators" on top of GRPO. This allows the model to reject answering only when truly exceeding its boundaries. Compared to GRPO, BAPO improves reliability across four multi-hop QA datasets by approximately 9.7% on average and outperforms Search-R1 (trained on 90k samples) using only 5k training samples.

Benchmarking Web Agent Safety under E-commerce Deceptive Interfaces

The authors developed WebDecept—a lightweight, pluggable "deceptive interface injection layer" that can insert seven types of common real-world deceptive patterns (pop-ups, banners, domain redirection, hidden cart additions, price changes, etc.) into the VisualWebArena e-commerce environment at specific trigger times to test the safety of multimodal web agents. The results show that advanced agents like GPT-5.1, Claude 4.5, and Gemini 2.5 are generally vulnerable, particularly to "hidden cart/total price manipulation," where they almost entirely failed, and safety prompts were unable to mitigate these risks.

ChartAgent: A Multimodal Agent for Visually Grounded Reasoning in Complex Chart Question Answering

ChartAgent transforms chart question answering from "textual chain-of-thought" to "acting on the image itself." By using a suite of chart-specific visual tools (segmenting pie slices, isolating bars, locating axes) within a ReAct loop and performing self-verification on intermediate visualizations, it achieves gains of up to 16.07% on ChartBench / ChartX for unannotated and numerical-heavy challenges, with a 17.31% improvement on the unannotated subset.

CLAG: Adaptive Memory Organization via Agent-Driven Clustering for Small Language Model Agents

This paper proposes CLAG, a cluster-based Agent memory framework. It organizes memories into semantically consistent clusters via SLM-driven routing, performs local evolutionary updates within clusters, and filters noise through two-stage retrieval. It significantly outperforms global memory pool baselines across multiple QA datasets.

CodeStruct: Code Agents over Structured Action Spaces

This paper proposes the CodeStruct framework, which redefines code repositories as AST-based structured action spaces. It enables LLM code agents to perform read and edit operations through named program entities (rather than text snippets), achieving a \(1.2-5.0\%\) accuracy improvement on SWE-Bench Verified while reducing token consumption by \(12-38\%\).

Browse all 82 LLM Agent papers →


👥 Multi-Agent (40)

A Multi-Agent Framework for Feature-Constrained Difficulty Control in Reading Comprehension Item Generation

This paper proposes MAFIG, a framework that leverages multi-agent collaboration, feature-level evaluators, and iterative revision to generate multiple-choice reading comprehension questions. Compared to single-turn prompting, it significantly improves the satisfaction rate of constraints such as vocabulary, passage length, sentence length, reasoning complexity, factuality, and option neutrality, while providing a more stable monotonic increase in difficulty.

AgenticEval: Toward Agentic and Self-Evolving Safety Evaluation of Large Language Models

AgenticEval redefines LLM safety evaluation as a "continuous, self-evolving red-teaming process": the Specialist decomposes unstructured regulatory text into an atomic rule knowledge base; the Generator creates multimodal and multi-format Question Groups centered around each rule; the Evaluator + Analyst continuously transform failures from the current round into more aggressive attack strategies for the next. After three iterations, the compliance rate of GPT-5 under the EU AI Act plummeted from 72.50% to 36.36%, revealing that static benchmarks significantly overestimate the safety levels of large models.

ATLAS: Adaptive Trading with LLM AgentS Through Dynamic Prompt Optimization and Multi-Agent Coordination

This paper proposes the ATLAS multi-agent financial trading framework and the Adaptive-OPRO prompt optimization method. By utilizing specialized analyst agents to prepare heterogeneous market information and dynamically optimizing the instruction prompts of the central trading agent based on delayed noisy feedback, the system significantly outperforms baselines across diverse volatile market environments.

AutoReproduce: Automatic AI Experiment Reproduction with Paper Lineage

AutoReproduce proposes a multi-agent framework that utilizes a "Paper Lineage" algorithm to mine implicit domain knowledge from referenced literature. This enables end-to-end automatic reproduction of paper experiments, achieving a code execution rate of 94.87% and a performance gap of only 19.72% on the self-constructed ReproduceBench.

BookAgent: Orchestrating Safety-Aware Visual Narratives via Multi-Agent Cognitive Calibration

BookAgent is a safety-aware multi-agent framework that utilizes a three-stage closed-loop architecture consisting of a Value-Aligned Storyboard (VAS) + Iterative Cross-modal Refinement (ICR) + Temporal Cognitive Calibration (TCC) to generate high-quality, character-consistent, and safety-compliant picture book stories end-to-end from user drafts.

CIA: Inferring the Communication Topology from LLM-based Multi-Agent Systems

This paper proposes CIA (Communication Inference Attack), which, under a strict black-box setting where only the final output is observable, induces multi-agent systems to expose intermediate agent reasoning through adversarial queries. By combining global bias disentanglement with LLM weak supervision to model semantic correlations, it successfully reconstructs the MAS communication topology, achieving an average AUC of 0.87 and a peak of 0.99.

Collaborative Multi-Agent Scripts Generation for Enhancing Imperfect-Information Reasoning in Murder Mystery Games

Ours proposes a collaborative multi-agent framework for the automated generation of high-quality Murder Mystery game scripts and training data. Through a two-stage training strategy (CoT fine-tuning + GRPO reinforcement learning with ScoreAgent reward shaping), the multi-hop reasoning capability of VLMs under imperfect information is enhanced. This significantly improves VLM narrative reasoning, fact extraction, and deception resistance on WhodunitBench.

Conjunctive Prompt Attacks in Multi-Agent LLM Systems

This paper investigates conjunctive prompt attacks in multi-agent LLM systems: trigger keys embedded in user queries and hidden templates in compromised remote agents appear harmless individually, but activate harmful behavior when routing brings them to the same agent. Existing defenses (PromptGuard, Llama-Guard, etc.) cannot reliably prevent these attacks.

ConSensus: Multi-Agent Collaboration for Multimodal Sensing

ConSensus is a training-free multi-agent sensor fusion framework that assigns specialized agents to independently interpret different sensing modalities. By utilizing semantic fusion, statistical consensus, and hybrid arbitration, it achieves an average 7.1% accuracy improvement over single-agent methods across five multimodal sensing benchmarks, while reducing fusion token costs to approximately 1/12.7 of multi-round debate methods.

Debating the Unspoken: Role-Anchored Multi-Agent Reasoning for Half-Truth Detection

This paper proposes the RADAR framework, which detects half-truths based on omitted context through role-anchored (Politician vs. Scientist) multi-agent debate. Combined with a dual-threshold adaptive early stopping mechanism, it consistently outperforms single-agent and traditional multi-agent baselines under noisy retrieval conditions.

Browse all 40 Multi-Agent papers →


⚖️ Alignment & RLHF (38)

AdaJudge: Adaptive Multi-Perspective Judging for Reward Modeling

To address two structural defects in reward models—the fixed spatial inductive bias and the misalignment between generative backbone representations and discriminative tasks caused by "compressing the entire sequence into a scalar via fixed pooling (e.g., last-token)"—AdaJudge proposes a gated refinement block to reshape representations into a discriminative space. It then utilizes "domain-aware gated multi-perspective pooling" to dynamically fuse evidence from last-token, mean, and attention poolings conditioned on the prompt. This approach allows 4B/8B models to outperform strong 27B off-the-shelf reward models on RM-Bench and JudgeBench.

AgentV-RL: Scaling Reward Modeling with Agentic Verifier

The reward model is reshaped from a "single-turn scoring" mechanism into a multi-turn deliberation process featuring "forward + backward dual agents + tool calls." Through SFT+GRPO, these multi-agent capabilities are distilled into a single 4B model, which outperforms 70B-scale ORMs by 25.2% in Best-of-N (BoN) selection.

Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling

Plan-RewardBench is proposed as a trajectory-level preference benchmark for complex tool-augmented scenarios, designed to evaluate the capability of reward models in distinguishing superior from inferior agent trajectories across multi-step planning, tool usage, and error recovery.

Alignment Data Map for Efficient Preference Data Selection and Diagnosis

This paper proposes the Alignment Data Map, an analytical tool that visualizes, selects, and diagnoses preference data by jointly considering response quality and variability. It achieves the alignment performance of full-set training using only 33% of the data.

ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System

ARES detects "systemic weaknesses" (simultaneous failure of both Core LLM and Reward Model) using a Safety Mentor that dynamically combines a quaternary structure of "Topic / Persona / Goal / Tactic." It subsequently employs a two-stage closed-loop process—repairing the RM before the policy—to raise the RedTeam safety rate from 0.28 to 0.96 with negligible loss in general capabilities.

BACH-V: Bridging Abstract and Concrete Human-Values in Large Language Models

This paper proposes an abstraction-grounding framework that decomposes the conceptual understanding of LLMs into three layers: "abstract-abstract, abstract-concrete, and concrete-concrete." Using concept probing and activation steering across 6 open-source LLMs and 10 value dimensions, the authors demonstrate that structured value representations exist within LLMs, migrate across abstraction layers, and causally drive concrete decisions.

Better Literary Translation: A Multi-Aspect Data Generation and LLM Training Approach

This paper decomposes literary translation quality into two dimensions: "expression fluency" and "literary effect." By using specialized LLMs to iteratively generate high-quality reference translations and preference pairs, the authors employ SFT + explicit Reward Model + GRPO to train LitMT. This allows 8B/14B small models to approach or even surpass some large models in English-to-Chinese literary translation.

Compatibility-Aware Dynamic Fine-Tuning for Large Language Models

Building on the token-level stabilization method DFT, CADFT introduces a "sample-level compatibility" signal calculated from the model's own likelihood to re-weight supervised gradients. It further employs a delayed, low-frequency "compatibility-guided rewriting" to transform stubborn, difficult samples into learnable targets. This suppresses high-variance gradients without reward models or RL, enhancing SFT stability, generalization, and the quality of cold-start RL initialization.

ComplexConstraints and Beyond: Expert Rubrics for RLVR

This paper systematically demonstrates that "expert-written fine-grained scoring rubrics" serve as both more reliable evaluation tools for frontier LLMs and data-efficient RLVR reward signals. It proposes five design principles for constructing high-quality rubrics and introduces the ComplexConstraints dataset, where each prompt contains 10–40 atomic criteria. Empirical results show that performing RLVR with only ~1,000 expert samples improves the instruction-following capability of a 4B model by +15.5 pp and a 235B model by +12.2 pp. Furthermore, single-epoch agentic training successfully transfers to out-of-distribution (OOD) benchmarks that the model never encountered during training (BFCL +4.5 / τ²-Bench +7.4 / Toolathlon +6.8 pp).

ConsistRM: Improving Generative Reward Models via Consistency-Aware Self-Training

ConsistRM proposes a consistency-aware self-training framework. By utilizing two modules—temporal consistency pseudo-labels (preference consistency merging online states and historical memory) and semantic consistency critique rewards (measuring semantic similarity of multiple generated critiques)—it improves the average performance of generative reward models by 1.5% across five benchmarks without human annotation, while significantly mitigating position bias.

Browse all 38 Alignment & RLHF papers →


🔒 LLM Safety (115)

STELA: A Linguistics-Aware LLM Watermarking via Syntactic Predictability

STELA uses "linguistic indeterminacy" \(\lambda(c_t)\) estimated from POS n-grams as a modulation signal for watermark strength. It weakens the watermark at positions with high syntactic constraints (preserving quality) and strengthens it at syntactically free positions (improving detectability). Similar to KGW, STELA remains publicly verifiable using only a POS tagger, without requiring access to model logits.

A Survey on the Safety and Security Threats of Computer-Using Agents: JARVIS or Ultron?

This paper provides the first systematic review of safety research for "Computer-Using Agents (CUA)," organizing 124 relevant papers into a four-dimensional framework of "Internal Threats × External Threats × Defense × Evaluation," and highlighting that the primary gaps in existing CUAs are UI grounding robustness and cross-platform adversarial evaluation.

Abstain-R1: Calibrated Abstention and Post-Refusal Clarification via Verifiable RL

Abstain-R1 proposes a clarification-aware RLVR reward to jointly optimize "explicit refusal" and "providing helpful clarifications (pointing out missing information) post-refusal" on unanswerable queries. This allows 3B models to approach or even surpass large models such as DeepSeek-R1 in refusal and clarification quality.

ACIArena: Toward Unified Evaluation for Agent Cascading Injection

This paper constructs the first unified evaluation framework for "Agent Cascading Injection (ACI)" attacks, ACIArena. It covers 6 mainstream multi-agent systems (MAS), 3 attack surfaces (Adversarial Input / Malicious Agent / Message Poison), and 3 attack goals (Hijacking / Disruption / Exfiltration) with 1356 test cases. It also proposes ACI-Sentinel, a minimalist yet effective defense that reduces Hijacking attack success rates from 92.78% to 8.06%.

Adaptive Text Anonymization: Learning Privacy-Utility Trade-offs via Prompt Optimization

Discovered task-specific anonymization instructions for LLMs via an adaptive framework using evolutionary prompt optimization. It outperforms hand-crafted strategies across multiple privacy-utility trade-off scenarios and is executable on open-source models.

ADVICE: Answer-Dependent Verbalized Confidence Estimation

This paper diagnoses the root cause of LLM verbalized overconfidence as "confidence hardly depends on the generated answer" through JSD and attribution analysis. It proposes ADVICE, a lightweight contrastive fine-tuning framework using answer pairs, which employs JSD/Margin/Sum losses to force the confidence distribution for correct answers to be significantly higher than for incorrect ones. This reduces Gemma2-9b's ECE on TriviaQA from 21.9% to 6.2% while maintaining task accuracy.

AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios

AgentCoMa constructs an agentic benchmark that forcibly combines commonsense selection with single-step mathematical operations. Evaluations across 61 LLMs reveal that while models typically solve both sub-problems independently (80%), the average accuracy drops to 51% when combined, exposing significant vulnerabilities in mixed-type compositional reasoning.

AgentMark: Utility-Preserving Behavioral Watermarking for Agents

AgentMark models the "next tool/subgoal selection" of an LLM agent as a time-varying discrete channel. By explicitly eliciting the behavioral distribution \(P_t\) and applying FDPSS-style distribution-preserving sampling, it embeds multi-bit IDs into planning decisions. Combined with RLNC encoding, the watermark can be recovered from residual logs even if the trace is cropped or steps are deleted. Across ALFWorld, ToolBench, and OASIS tasks, it maintains accuracy (SR difference from baseline <0.7 pp) while providing stable multi-bit capacity of 1.2-2.3 bps, and it is orthogonally stackable with content-level watermarks like SynthID-Text.

AGSC: Adaptive Granularity and Semantic Clustering for Uncertainty Quantification in Long-text Generation

AGSC proposes an uncertainty quantification (UQ) framework for long-text generation that triggers adaptive granularity decomposition via NLI neutral probability (reducing inference time by 60%) and utilizes GMM soft clustering to capture latent semantic topics for topic-aware weighted aggregation, achieving SOTA factuality correlation on BIO and LongFact benchmarks.

APPSI-139: A Parallel Corpus of English Application Privacy Policy Summarization and Interpretation

APPSI-139 is the first parallel corpus of English application privacy policy summarization and interpretation finely annotated by legal experts (139 policies / 36,351 annotations / 15,692 rewrite pairs). The accompanying TCSI-pp-V2 framework utilizes a shared encoder with five alternately trained expert heads for five sub-tasks: "Importance / Risk / Sensitivity / Topic / Rewriting." Compared to TCSI-pp v1, the encoding time is reduced by 73%, and GPU memory usage decreases from 7.3GB to 2.7GB, with subjective readability surpassing GPT-4o and Llama3-70b.

Browse all 115 LLM Safety papers →


👻 Hallucination Detection (28)

Aligning with Your Own Voice: Self-Corrected Preference Learning for Hallucination Mitigation in LVLMs

The AVES-DPO framework is proposed. It utilizes consensus-based multi-model verification (YOLO, Grounding DINO, and Qwen3-VL) to detect fine-grained hallucinations (object, attribute, and relation) in responses generated by the LVLM itself. The target LVLM then performs self-correction and detail enrichment, creating preference pairs naturally within the model's "in-distribution." With only 5.2K samples, it outperforms SOTA methods relying on GPT-4V teachers across multiple hallucination benchmarks (achieving ~25× data efficiency).

Benchmarking Deflection and Hallucination in Large Vision-Language Models

This paper proposes VLM-DeflectionBench, a multimodal benchmark with 2775 samples that systematically evaluates the deflection vs. hallucination behaviors of Large Vision-Language Models (LVLMs) when evidence is insufficient or misleading across four evaluation scenarios (Parametric/Oracle/Realistic/Adversarial). Experiments covering 20 SOTA LVLMs reveal that nearly all models fail to reliably deflect under noisy evidence.

Detecting Hallucinations in SpeechLLMs at Inference Time Using Attention Maps

Ours proposes four audio-attention-based metrics (AudioRatio, AudioConsistency, AudioEntropy, TextEntropy) to train a lightweight logistic regression classifier for detecting hallucinations in SpeechLLMs during inference, achieving a PR-AUC improvement of up to +0.23 on in-domain data.

Dialectic-Med: Mitigating Diagnostic Hallucinations via Counterfactual Adversarial Multi-Agent Debate

Ours proposes Dialectic-Med, a multi-agent medical diagnostic framework inspired by Popper’s falsificationism. Through adversarial dialectical reasoning among a Proposer (diagnostic hypothesis), an Opponent (Visual Falsification Module actively retrieving contradictory evidence), and a Mediator (weighted consensus graph decision), it achieves SOTA on MIMIC-CXR-VQA, VQA-RAD, and PathVQA, improving explanation faithfulness by 12.5% and significantly mitigating diagnostic hallucinations.

Distorted or Fabricated? A Survey on Hallucination in Video LLMs

This paper provides the first systematic classification of hallucinations in Video Large Language Models (Vid-LLMs), proposing a mechanism-driven taxonomy comprising "Dynamic Distortion" (errors in spatiotemporal relations and reference consistency) and "Content Fabrication" (driven by statistical priors and audio-visual conflicts), while surveying evaluation benchmarks, mitigation strategies, and root causes.

Enhancing Hallucination Detection via Future Context

This paper proposes utilizing sampled "future context" (subsequent sentences) to enhance hallucination detection in black-box scenarios. By leveraging the "snowball effect"—where hallucinations tend to propagate once they occur—the method consistently improves performance across various sampling-based approaches such as SelfCheckGPT and SC.

FaithLens: Detecting and Explaining Faithfulness Hallucination

This paper proposes FaithLens, an 8B parameter faithfulness hallucination detection model. It undergoes cold-start SFT using high-quality data synthesis combined with three-dimensional filtering (label correctness, explanation quality, and data diversity), followed by further optimization via rule-based reinforcement learning (prediction correctness reward + explanation quality reward). It surpasses GPT-5.2 and o3 across 12 tasks while providing high-quality explanatory outputs.

FinGround: Detecting and Grounding Financial Hallucinations via Atomic Claim Verification

FinGround is a three-stage "verify-then-ground" pipeline for financial document QA: (1) finance-aware hybrid retrieval; (2) decomposing answers into atomic claims and verifying them using a type-routed strategy across a six-category taxonomy (Numerical, Temporal, Entity Property, Comparative, Regulatory, Computational—where computational claims use formula reconstruction and arithmetic re-verification); (3) grounded rewriting of unsupported claims with paragraph/cell-level citations. By distilling GPT-4o into an 8B detector, it achieves a 91.4% F1 score with 18× acceleration, reducing the hallucination rate by 78% compared to GPT-4o+CoT.

Generating Effective CoT Traces for Mitigating Causal Hallucination

This paper proposes the Causal Hallucination Rate (CHR) metric to quantify the tendency of small LLMs to over-predict causal relationships in Event Causality Identification. Through systematic experiments, two key criteria for effective CoT data are identified (sufficient semantic explanation length + distribution alignment with the target model). A low-cost CoT data generation pipeline is designed, reducing the CHR of Qwen2.5-1.5B from 83.54% to 6.26% while improving average accuracy to 66.00%.

HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models

This paper introduces HalluAudio, the first large-scale cross-domain (speech/ambient/music) benchmark for audio hallucination detection. It features 5,000+ human-verified QA pairs and systematic adversarial prompt designs. By evaluating mainstream LALMs using multi-dimensional metrics (Accuracy, Hallucination Rate, Yes-No Bias, Refusal Rate, and Error Types), the study reveals significant deficiencies in current models regarding acoustic anchoring, temporal reasoning, and music attribute understanding.

Browse all 28 Hallucination Detection papers →


📊 LLM Evaluation (97)

AgentEval: DAG-Structured Step-Level Evaluation for Agentic Workflows with Error Propagation Tracking

AgentEval models agent execution traces as "Evaluation DAGs," using GPT-4o as a judge to score nodes across five types and trace root causes through a greedy parent strategy. Combined with 21 failure categories and CI/CD integration, it achieved a 2.17× improvement in failure detection recall (0.41→0.89) over end-to-end evaluation on 450 production traces. It reached human consistency of \(\kappa=0.84\), root cause accuracy of 72% (approaching the human limit of 81%), and reduced the median root cause localization time from 4.2 hours to 22 minutes in a 4-month pilot.

Aggregate vs. Personalized Judges in Business Idea Evaluation: Evidence from Expert Disagreement

Addressing the reality of systematic expert disagreement in business idea evaluation, this work constructs the PBIG-DATA dataset containing 3,000 individual expert ratings. It empirically demonstrates that "personalized judges" (conditioned on a target reviewer's history) align better with expert behavior than "aggregate judges" (conditioned on mixed reviewer histories), challenging the common assumption of using pooled labels as the sole ground truth.

AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation

This paper introduces AJ-Bench, the first benchmark to systematically evaluate the capabilities of Agent-as-a-Judge. It covers three domains—Search, Data Systems, and GUI—with a total of 155 tasks and 516 annotated trajectories. Experiments demonstrate that Agent-as-a-Judge improves the average \(F1\) score by approximately 13 percentage points compared to LLM-as-a-Judge.

Are They Lovers or Friends? Evaluating LLMs' Social Reasoning in English and Korean Dialogues

This paper proposes the SCRIPTS benchmark, containing 1.1K English and Korean movie dialogues, to evaluate the social relation reasoning capabilities of 9 LLMs through three-tier probabilistic labels (HIGHLY LIKELY / LESS LIKELY / UNLIKELY). The study finds that models achieve only 75-80% accuracy in English and 58-69% in Korean, and CoT or reasoning-based models provide almost no benefit for social reasoning.

arXiv2Table: Toward Realistic Benchmarking and Evaluation for LLM-Based Literature-Review Table Generation

The authors present the arXiv2Table benchmark (1,957 tables, 7,158 papers), which achieves a more realistic evaluation of LLM-based literature-review table generation by introducing distractor papers, schema-agnostic user demands, and a QA-based reference-free evaluation framework, alongside an iterative batch generation method.

Attribution, Citation, and Quotation: A Survey of Evidence-based Text Generation with Large Language Models

This paper systematically reviews 134 papers on evidence-based text generation for LLMs. It proposes the first unified taxonomy (Attribution Mechanism × Citation Features × Task), analyzes 300 evaluation metrics categorized into seven dimensions and six methods, and provides a panoramic reference framework for this fragmented field.

Automated Creativity Evaluation of Language Models Across Open-Ended Tasks

This paper proposes an automated, task-decoupled, and reference-free framework to quantify LLM creativity. "Semantic Entropy" is employed to measure divergent creativity (novelty and diversity of ideas), while "Retrieval-based Multi-agent Judging" measures convergent creativity (whether the solution effectively addresses the problem). The study systematically uncovers the impact of model scale, temperature, and reasoning capabilities on creativity across three domains: problem-solving, scientific hypothesis generation, and creative writing.

BadScientist: Can a Research Agent Write Convincing but Unsound Papers that Fool LLM Reviewers?

The authors developed a "BadScientist" pipeline: a generation agent that conducts no real experiments uses five "performative fraud" strategies to write seemingly rigorous but fundamentally unsound papers. These are then fed to a multi-model reviewer agent composed of o3 / o4-mini / GPT-4.1. Results show that the acceptance rate for fraudulent papers reaches up to 82%. Furthermore, reviewers often point out integrity issues in their text comments while still assigning acceptance scores (concern-acceptance conflict), and existing mitigation methods perform barely better than random guessing.

BenchMarker: An Education-Inspired Toolkit for Highlighting Flaws in Multiple-Choice Benchmarks

Drawing on mature quality control frameworks for multiple-choice questions (MCQs) from the field of education, this work constructs BenchMarker. This tool uses LLM-as-judge to audit 12 mainstream NLP MCQA benchmarks across three dimensions: "contamination + shortcuts + writing errors." The study finds that 47% of TruthfulQA questions can be found directly online, while 100% of HellaSwag questions violate multiple writing rules. It empirically demonstrates that these flaws significantly inflate or deflate LLM accuracy and even alter model rankings.

Beyond Case Law: Evaluating Structure-Aware Retrieval and Safety in Statute-Centric Legal QA

To be added after deep reading.

Browse all 97 LLM Evaluation papers →


⚡ LLM Efficiency (23)

Alloc-MoE: Budget-Aware Expert Activation Allocation for Efficient Mixture-of-Experts Inference

The "number of activated experts" in MoE inference is abstracted as a global budget \(B\). Optimal Top-K allocation is performed across layers via dynamic programming (Alloc-L), followed by token-level redistribution using global Top-\((K \cdot T)\) selection (Alloc-T). This approach halves the activation budget of DeepSeek-V2-Lite while maintaining accuracy, achieving a 1.15× speedup in prefill and a 1.34× speedup in decode.

Are Large Language Models Economically Viable for Industry Deployment?

The Edge-Eval framework is proposed to evaluate the full life cycle of LLMs on traditional T4 GPUs through five deployment metrics (Economic Break-even, Intelligence-Power Ratio, System Density, Cold Start Tax, and Quantization Fidelity). It reveals that small models (<2B) are comprehensively superior to 7B models in economic and ecological dimensions and identifies an anomalous phenomenon where QLoRA increases energy consumption by up to 7x despite reducing memory usage.

Beyond Accuracy: Unveiling Inefficiency Patterns in Tool-Integrated Reasoning

This paper proposes PTE (Prefill Token Equivalents), a hardware-aware efficiency metric for tool-integrated reasoning (TIR) that unifies the costs of internal reasoning and external tool usage. Through large-scale experiments, it reveals four inefficiency patterns in TIR: confirmatory tool use, tool mixing, lack of tool priors, and tool format collapse.

BOSCH: Black-Box Binary Optimization for Short-Context Attention-Head Selection in LLMs

The authors propose BOSCH, a training-free mixture-of-SWA method at the attention-head level. It models the SWA head selection as a Large Neighborhood Search (LNS) problem and decomposes it into a three-stage optimization (Layer Importance Probing → Adaptive Rate Assignment → Grouped Head Selection). It systematically outperforms layer-level heuristics and six static head-level methods across four models and four ratio settings.

Breaking Block Boundaries: Anchor-based History-stable Decoding for Diffusion Large Language Models

This paper proposes AHD (Anchor-based History-stable Decoding), a training-free, plug-and-play dynamic decoding strategy. By utilizing dynamic anchors to backtrack historical trajectories and identify cross-block stable tokens in diffusion LLMs, AHD achieves early unlocking. It reduces decoding steps by 80% on BBH while simultaneously improving performance by 3.67%.

CoMeT: Collaborative Memory Transformer for Efficient Long Context Modeling

CoMeT introduces a "global memory + FIFO temporary memory" dual-memory plug-in for existing LLMs. By processing inputs in chunks, it achieves constant memory and linear time complexity. Fine-tuned only on 32k context, it enables precise retrieval at any position within 1M tokens and proposes hierarchical pipeline parallelism to allow fine-tuning 128k context on 16×80GB GPUs.

CreditDecoding: Accelerating Parallel Decoding in Diffusion Large Language Models with Trace Credit

This paper proposes CreditDecoding, a training-free parallel decoding acceleration method that enhances correct but low-confidence tokens by accumulating token-level historical evidence (trace credit), achieving up to a 5.48x speedup and a 0.48 accuracy improvement on LLaDA-8B-Instruct.

Lizard: An Efficient Linearization Framework for Large Language Models

Lizard replaces the softmax attention of pretrained Transformers with a hybrid subquadratic attention module (Gated Linear Attention for global compression + Anchor Window Attention for local precision + learnable gates replacing RoPE). Using only 0.04B tokens for distillation, it outperforms existing linearization methods by 9.4–24.5 points on 5-shot MMLU and achieves a 32% throughput increase via a tensor-core-friendly training algorithm.

MTRouter: Cost-Aware Multi-Turn LLM Routing with History-Model Joint Embeddings

MTRouter models the selection of "which LLM to invoke at each turn" within multi-turn agent tasks as a per-turn routing problem under cost constraints. By using history-model joint embeddings to predict the contribution of candidate models to the final task outcome, it improves task performance while significantly reducing total invocation costs on ScienceWorld and HLE.

Multi-Drafter Speculative Decoding with Alignment Feedback

This paper proposes MetaSD, a unified framework that integrates multiple heterogeneous drafters into speculative decoding. By modeling drafter selection as a Multi-Armed Bandit (MAB) problem and using Block Divergence as a reward signal, MetaSD dynamically selects the drafter most aligned with the target LLM. It consistently outperforms single-drafter methods in both black-box and white-box configurations.

Browse all 23 LLM Efficiency papers →


📚 Pretraining (12)

Compact Example-Based Explanations for Language Models

This paper proposes Selection Relevance Score, a re-training-free metric to evaluate the quality of training sample subsets as example-based explanations. It demonstrates that the common "select highest influence" strategy is often inferior to random selection and further introduces a new strategy that balances influence and representativeness.

Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training

This paper proposes Data Mixing Agent, the first model-based end-to-end domain re-weighting framework. By training a small agent using CQL reinforcement learning on extensive data mixing trajectories, it learns generalizable data mixing heuristics. It balances performance between source and target domains in mathematical reasoning continual pre-training and generalizes to unseen source domains, target models, and domain spaces.

Demystifying Data Organization for Enhanced LLM Training

This paper systematically investigates the impact of "sample appearance order" in LLM training. By reusing existing sample-level quality/difficulty scores, it proposes four data organization principles: boundary reinforcement, cyclic review, continuous curriculum, and local diversity. The proposed STR and SAW strategies consistently enhance performance in both pre-training and SFT.

Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective

The authors utilize Probabilistic Hierarchical Context-Free Grammar (HPCFG) to construct a set of "contamination-free, bounded, and precisely samplable" formal languages as controlled testbeds. They propose the "Discriminative AUC Test" as a unified metric to systematically compare FT and ICL across 18 LLMs from 6 families on 6 languages. The study finds that FT consistently outperforms ICL in-distribution, but both perform equally on out-of-distribution data; ICL shares a similar inductive bias with FT but exhibits significantly higher sensitivity to specific tokens.

FOREVER: Forgetting Curve-Inspired Memory Replay for Language Model Continual Learning

The authors realign the "spaced repetition" concept of the Ebbinghaus forgetting curve from "training steps" to "model time" (accumulated parameter update norm \(\Delta_t = \|\Theta_t - \Theta_{t-1}\|_2\)). Specifically, cumulative model time \(\tau_t\) determines when to replay, while the instability ratio \(r_t\) (current update intensity \(\mu_t\) vs. baseline \(\mu_0\)) adaptively controls how to replay (regularization strength). The method consistently outperforms SOTA across 3 CL benchmarks and 4 backbones (0.6B–13B), achieving OP +1.2% and BWT +0.9% over the strongest baseline VBM.

Is a Document Educational or Just Wikipedia-Style? -- Pitfalls of Classifier-Based Quality Filtering

This paper discovers that Classifier-based Quality Filtering (CQF) mistakenly equates "Wikipedia-style writing" with "higher educational value." Simple rewriting allows low-quality web pages to bypass pre-training data filtering thresholds; approximately 7% of samples in FineWeb-Edu flip their filtering decisions as a result.

KoCo: Conditioning Language Model Pre-training on Knowledge Coordinates

Ours proposes Knowledge Coordinate (KoCo) conditioning for pre-training, which maps each document to a three-dimensional semantic coordinate (Source, Content, Stability). These coordinates are injected into pre-training as text prefixes, providing the model with explicit context-awareness. This approach improves performance across 10 downstream tasks, accelerates convergence by approximately 30%, and effectively mitigates hallucinations.

On the Proper Treatment of Units in Surprisal Theory

This paper points out that the choice of the "next unit" in surprisal theory has historically been implicitly determined by pre-trained language model tokenizers. It proposes a finite-state transduction framework that explicitly decouples model tokens, linguistic units, and experimental Regions of Interest (ROI), demonstrating on MECO eye-tracking data that different unit inventories fundamentally alter how surprisal predicts reading time.

SAGE: Sign-Adaptive Gradient for Memory-Efficient LLM Optimization

Ours proposes the SAGE optimizer, which addresses the "embedding layer dilemma" where lightweight optimizers fail on embedding layers. By combining a Lion-style sign update direction with an \(O(d)\) memory overhead adaptive damping scaling factor, SAGE achieves new SOTA perplexity on Llama models (up to 1.3B) with significantly lower optimizer memory.

SCRIPT: A Subcharacter Compositional Representation Injection Module for Korean Pre-Trained Language Models

Ours proposes SCRIPT, a model-agnostic plug-and-play module that injects Hangul subcharacter (Jamo) compositional knowledge into the embedding layers of existing subword-level PLMs using a dual-channel strategy. It achieves consistent improvements across Korean NLU/NLG tasks without re-pretraining and enables the embedding space to better capture grammatical regularities and semantic variations.

Browse all 12 Pretraining papers →


✏️ Knowledge Editing (10)

Aligning Language Models with Real-time Knowledge Editing

Introduces CRAFT (a continuously updated Chinese financial knowledge editing dataset) and KEDAS (a knowledge editing alignment paradigm based on diverse edit augmentation and adaptive inference) to resolve the difficulty of balancing success rate, locality, and portability in real-time knowledge editing scenarios.

Can Factual Opinions Be Edited (Manipulated) in Large Language Models?

This paper points out that existing knowledge editing techniques can be used to manipulate the "documented stances of public figures" (factual opinions). To address this, the authors construct the FOE benchmark with evidence and find that current methods result in "surface-level opinion changes with contradictory evidence." They propose a two-stage Self-Generated Evidence-Aligned method, enabling edited models to provide self-consistent evidence for manipulated opinions without relying on explicit instructions.

CLaRE-ty Amid Chaos: Quantifying Representational Entanglement to Predict Ripple Effects in LLM Editing

CLARE proposes a lightweight representational method that quantifies the degree of entanglement between facts through forward activations of a single intermediate layer. It is used to predict ripple effects in model editing, achieving an average improvement of 62.2% in Spearman correlation compared to gradient-based methods, while being 2.74x faster with 2.85x less memory consumption.

EvoEdit: Evolving Null-space Alignment for Robust and Efficient Knowledge Editing

Ours proposes EvoEdit, which achieves large-scale sequential knowledge editing by dynamically evolving a null-space projector. It efficiently injects new knowledge while maintaining existing knowledge, preserving SOTA performance at the 10K editing scale and running 3.5x faster than AlphaEdit.

FABLE: Fine-grained Fact Anchoring for Unstructured Model Editing

This paper identifies that existing unstructured model editing methods, while capable of holistic recall of edited text, fail to provide access to fine-grained facts. It proposes the FABLE framework, which uses a two-stage hierarchical strategy to anchor fine-grained facts in shallow layers and integrate holistic narratives in deep layers, and constructs the UnFine diagnostic benchmark for systematic evaluation.

HiEdit: Lifelong Model Editing with Hierarchical Reinforcement Learning

HiEdit utilizes hierarchical reinforcement learning to decompose "lifelong model editing" into two subtasks: high-level layer selection and low-level gradient update calculation. This allows the hypernetwork to adaptively modify only half of the layers based on specific knowledge, improving the strong baseline RLEdit by an average of 8.48%.

One Mask to Rule Them All: On Hidden Facts after Editing and How to Find Them

This paper discovers that ROME / MEMIT does not truly overwrite old knowledge but suppresses it through a shared overattention mechanism; a sparse binary mask can reverse most edits and reduce the success rate of new edits from 98% to 38%.

Representation Interventions Enable Lifelong Knowledge Memory Control in LLMs

This paper proposes RILKE, which transforms lifelong knowledge editing from "modifying model weights" to "applying low-rank interventions in the hidden representation space." Through robust training, query-adaptive routing, and shared subspace modules, RILKE maintains near-perfect editing success rates and strong generalization after 1,000 unstructured knowledge edits while significantly reducing storage overhead.

Spectral Characterization and Mitigation of Sequential Knowledge Editing Collapse

The paper explains why sequential knowledge editing causes LLM general ability collapse from the perspective of SVD spectral structure and proposes REVIVE. By filtering update components that interfere with the dominant singular subspace within the singular vector basis of the original weights, REVIVE enables editors like MEMIT, RECT, and AlphaEdit to maintain both editing success rates and general capabilities under 10,000 to 20,000 continuous edits.

The Model Agreed, But Didn't Learn: Diagnosing Surface Compliance in Large Language Models

The SA-MCQ diagnostic framework is proposed to reveal the "surface compliance" phenomenon in knowledge editing—where editors achieve high scores on standard benchmarks but fail to truly overwrite internal beliefs. Models revert to original parametric memory in discriminative self-assessment, and sequential editing accumulates representational residue, leading to cognitive instability.


💬 LLM (Other) (62)

A Study of LLMs' Preferences for Libraries and Programming Languages

This study presents the first systematic investigation into the preferences of 8 LLMs regarding libraries and programming languages during code generation. It reveals that LLMs exhibit a severe bias toward popular libraries like NumPy (45% unnecessary usage) and the Python language (chosen in 58% of high-performance tasks), and that natural language recommendations often diverge from actual code selection behavior.

Adam's Law: Textual Frequency Law on Large Language Models

This paper proposes the "Textual Frequency Law" (TFL), revealing that for identical semantics, utilizing higher-frequency textual expressions to prompt or fine-tune LLMs yields superior performance. It further introduces frequency distillation and curriculum training strategies to leverage this law.

AlphaContext: An Evolutionary Tree-based Psychometric Context Generator for Creativity Assessment

AlphaContext is proposed as an evolutionary tree-based psychometric context generator. Through four modules—HyperTree outline planning, MCTS sentence-by-sentence generation, MAP-Elites diversity optimization, and assessment-guided iterative refinement—it automatically generates high-quality long-text contexts for creativity assessment, outperforming baseline methods by an average of 8% across seven evaluation dimensions.

An Existence Proof for Neural Language Models That Can Explain Garden-Path Effects via Surprisal

By fine-tuning neural language models on garden-path sentences, this work demonstrates the existence of a neural LM capable of simultaneously explaining garden-path effects and natural reading times via surprisal, providing an existence proof for Surprisal Theory.

Automatic Combination of Sample Selection Strategies for Few-Shot Learning

This paper proposes the ACSESS method, which automatically identifies and combines complementary sample selection strategies through three mechanisms: forward selection, backward selection, and Datamodels. Validated across 23 strategies, 5 ICL models, 3 gradient-based few-shot learning methods, and 14 datasets (6 text, 8 image), the combined strategy consistently outperforms single strategies and ICL-specific baselines.

Big AI is Accelerating the Metacrisis: What Can We Do?

In this ACL 2026 position paper, Steven Bird argues that "Big AI"—industrialized LLM engineering driven by a few giants—is simultaneously accelerating three interconnected crises: the ecological crisis, the meaning crisis, and the language crisis. Given that ACL is the primary publisher of LLM research, it must shift from "individual compliance" to "collective action of a professional community." The author proposes seven specific reforms for ACL, including prioritizing public interest, resisting corporate capture, protecting critical NLP, and establishing an NLP policy track.

C-World: A Computer Use Agent Environment Creator

The authors formalize the "agent environment" as an Action / Task / Transition / Reward quadruple and implement it as C-World. It utilizes 5,571 real MCP tools, automated task synthesis, state controller perturbations, and dual-signal rewards for high-fidelity evaluation. Furthermore, it employs a "World Engine" to simulate tool responses without live APIs, enabling scalable training. Evaluation of 9 frontier LLMs reveals that "planning is generally strong while execution is generally weak." Fine-tuning with as few as 1,170 C-World trajectories outperforms a baseline trained on 119k samples.

Can AI Be a Good Peer Reviewer? A Survey of Peer Review Process, Evaluation, and the Future

The authors provide a systematic survey of AI-assisted peer review methods in the LLM era. They categorize "review generation" into four paradigms: fine-tuning / agent / RL / generation enhancement, classify "after-review" into rebuttal / meta-review / paper revision, and present a four-quadrant evaluation taxonomy (human / reference-based / LLM-based / aspect-oriented). Finally, they discuss the future across six directions: novelty, automatic evaluation, cross-domain, multimodality, and ethics.

CAST: Achieving Stable LLM-based Text Analysis for Data Analytics

The CAST framework is proposed to constrain the potential reasoning paths of LLMs through two mechanisms: Algorithmic Prompting and Thinking-before-Speaking. This significantly enhances inter-run stability for text summarization and labeling tasks without sacrificing output quality.

Characterizing the Expressivity of Local Attention in Transformers

The authors utilize Linear Temporal Logic (LTL) as a unified characterization tool to strictly prove the following equivalences: global-only Transformer \(\leftrightarrow \mathrm{LTL}[\mathrm{P}]\), \(k\)-local-only \(\leftrightarrow \mathrm{LTL}[\mathrm{Y}^{\leq k}]\), and hybrid global+local \(\leftrightarrow \mathrm{LTL}[\mathrm{P}, \mathrm{Y}^{\leq k}]\). Consequently, they demonstrate that local and global expressivities are incomparable, hybrid models are strictly more powerful, and 1-local is the most expressive within the local family. Theoretical predictions are empirically validated on synthetic regular languages and WikiText-2.

Browse all 62 LLM (Other) papers →


📖 NLP Understanding (34)

A Computational Method for Measuring "Open Codes" in Qualitative Analysis

This paper proposes a theory-based computational method to systematically evaluate human and AI performance in inductive qualitative coding through an LLM-enhanced code merging algorithm and four ground-truth-free metrics (Coverage, Overlap, Novelty, and Divergence).

Accurate and Efficient Statistical Testing for Word Semantic Breadth

This paper identifies that directly comparing the semantic breadth of two words using permutation tests in contextual embedding space severely inflates Type-I errors due to differences in mean directions. It proposes using Householder reflections to align mean directions before permutation, reducing Type-I errors by 32.5%, and provides a GPU batch implementation achieving a 23x speedup.

AdapTime: Enabling Adaptive Temporal Reasoning in Large Language Models

This paper proposes AdapTime, which abstracts "temporal reasoning" into three reusable atomic actions: reformulate, rewrite, and review. Guided by an LLM Planner, the system adaptively decides which steps to execute and in what order based on the question and context. Without external tools, manual rules, or fine-tuning, it significantly improves LLM performance on temporal QA, pushing TimeQA-Easy to 85.4 EM on DeepSeek-V3.

Agree, Disagree, Explain: Decomposing Human Label Variation in NLI through the Lens of Explanations

The LiTEx reasoning taxonomy is extended from "explanation variation under label agreement" to "label disagreement" scenarios. It is found that annotators may have different labels but similar reasoning, and the consistency of reasoning categories reflects the semantic similarity of explanations better than label consistency.

ASTRA: Adaptive Semantic Tree Reasoning Architecture for Complex Table Question Answering

ASTRA adaptively reconstructs complex tables into semantic trees and employs a dual-mode reasoning approach consisting of text tree navigation and symbolic code execution. It achieves accuracies of 91.6%, 81.9%, and 90.1% on AIT-QA, SSTQA, and HiTab, respectively, outperforming strong LLMs and existing table structuralization methods.

Beyond Chunking: Discourse-Aware Hierarchical Retrieval for Long Document Question Answering

This paper leverages Rhetorical Structure Theory (RST) to parse the discourse organization of long documents, constructing a sentence-level hierarchical tree with intermediate nodes enhanced by LLM summarization. By performing structure-aware multi-granularity retrieval on this tree, the proposed method consistently outperforms fixed-size chunking and RAPTOR-style semantic clustering across four benchmarks: QASPER, QuALITY, NarrativeQA, and MultiFieldQA-zh.

BoundRL: Efficient Structured Text Segmentation through Reinforced Boundary Generation

BoundRL reframes structured text segmentation as a boundary generation task—generating only the starting tokens for each segment rather than the full text. This reduces output tokens by 90% and eliminates hallucination risks. Combined with a dual-objective reward function and a selective perturbation strategy for RLVR training, it enables a 1.7B small model to outperform the few-shot performance of Claude-4 Sonnet.

Can LLMs Estimate Cognitive Complexity of Reading Comprehension Items?

This paper constructs the ReCo reading comprehension cognitive complexity dataset and systematically evaluates whether 8 LLMs can automatically determine the required evidence scope and transformation levels for items. Results indicate that strong models approach but remain significantly lower than experts, particularly in identifying complete evidence sets and fine-grained word-order transformations.

Commonsense Knowledge with Negation: A Resource to Enhance Negation Understanding

Ours proposes an automated method to augment existing commonsense knowledge bases with negation, constructing a negation commonsense corpus of over 2 million triplets (\(\neg \text{Atomic}\) and \(\neg \text{Anion}\)), and demonstrates that pretraining on this corpus enhances the negation understanding capabilities of LLMs.

Creating ConLangs to Probe the Metalinguistic Grammatical Knowledge of LLMs

This paper introduces IASC (Interactive Agentic System for ConLangs), a modular constructed language generation system. By requiring LLMs to execute morphosyntactic transformations based on linguistic specifications, the study probes their metalinguistic knowledge. Findings reveal that LLMs handle common linguistic typological patterns significantly better than rare ones, and performance varies drastically across different models.

Browse all 34 NLP Understanding papers →


✍️ Text Generation (17)

Adaptive Planning for Multi-Attribute Controllable Summarization with Monte Carlo Tree Search

This paper proposes PACO, which reformulates "multi-attribute controllable summarization" as a planning problem to find an "attribute control sequence." Using a customized Monte Carlo Tree Search (where nodes are full summaries and actions are single-attribute adjustments), it identifies the optimal adjustment path during the prompting stage without any attribute-specific training. With Llama-3.2-1B, it achieves controllability comparable to the Llama-3.3-70B baseline, while Llama-3.3-70B + PACO surpasses all existing methods.

Are Emotion and Rhetoric Neurons in LLM? Neuron Recognition and Adaptive Masking for Emotion-Rhetoric Prediction Steering

This paper systematically investigates the representation mechanisms and intrinsic correlations of emotion and rhetorical neurons in LLMs. By proposing a neuron recognition framework combined with multi-dimensional screening and an adaptive masking verification method, it achieves directional induction of emotion/rhetoric prediction and utilizes rhetorical neurons to assist emotion recognition.

Can You Make It Sound Like You? Post-Editing LLM-Generated Text for Personal Style

The authors conducted a pre-registered online study with 81 participants who used GPT-o4-mini to draft and then manually post-edit style-sensitive texts such as wedding vows and apology letters. The findings reveal that while post-editing significantly moves the text toward the user's personal style and away from the LLM's style, the edited texts still systematically retain more "AI-like" traces than independent writing—a residue that participants themselves fail to perceive.

Children's English Reading Story Generation via Supervised Fine-Tuning of Compact LLMs with Controllable Difficulty and Safety

The authors utilized 2,580 stories generated by GPT-4o / Llama-3.3-70B corresponding to the UFLI K–2 English reading curriculum to perform four SFT designs (baseline, Good Stories, Rewarded SFT, and simulated children's pronunciation errors) on three 8B models (Llama 3 / Granite 3.3 / Apertus). The results demonstrate that compact models + appropriate SFT strategies can outperform zero-shot GPT-4o and Llama-3.3-70B on key K-2 metrics such as Spache readability, syntactic complexity, and toxicity. Among these, Rewarded SFT proved most stable and nearly hallucination-free.

ConlangCrafter: Constructing Languages with a Multi-Hop LLM Pipeline

This paper introduces ConlangCrafter, an LLM-based multi-hop pipeline that decomposes constructed language (conlang) design into modular stages of phonology, grammar, and lexicon. It ensures typological diversity through randomness injection and internal consistency via self-refinement loops, while proposing an automated evaluation framework encompassing typological diversity analysis and translation consistency.

Difficulty-Controllable Cloze Question Distractor Generation

This paper proposes DCDG, which enables easy/hard difficulty control for cloze distractor generation via dual-path data augmentation, QA ensemble difficulty clustering, and multi-task seq2seq training, significantly outperforming GPT-4o in both automatic and human evaluations.

EDUMATH: Generating Standards-aligned Educational Math Word Problems

The authors systematize the task of "generating math word problems (MWP) aligned with K-12 math curriculum standards," collecting 11,000+ STEM MWP training data points annotated by real US teachers. Through an SFT + KTO + ModernBERT filtering pipeline, they trained two open-source SOTA generators, EDUMATH-12B/30B. They conducted the first RCT on actual 3rd-5th grade students, finding that while student accuracy was comparable between LLM-generated and human-written problems, students showed an almost unanimous preference for customized LLM problems.

FACTS: Table Summarization via Offline Template Generation with Agentic Workflows

Ours proposes FACTS (Fast, Accurate, and Privacy-Compliant Table Summarization), which automatically generates reusable offline templates (SQL queries + Jinja2 templates) through a three-stage Agentic workflow. It achieves rapid, accurate, and privacy-compliant query-focused table summarization, outperforming baselines across FeTaQA, QTSumm, and QFMTS benchmarks.

Frankentext: Stitching Random Text Fragments into Long-Form Narratives

This paper proposes the Frankentext paradigm, which enables LLMs to stitch random human text fragments into coherent long-form narratives under extreme constraints (90% of text copied verbatim from human writing). This reveals the severe failure of current AI text detectors in mixed-authorship scenarios (72% of Frankentext is misclassified as human writing).

In-depth Research Impact Summarization through Fine-Grained Temporal Citation Analysis

This paper proposes the "Scientific Impact Summarization" task: first identifying fine-grained intents that truly reveal impact from the citation contexts of a paper, and then generating an impact narrative that evolves over time. This approach better illustrates how a paper is adopted, criticized, and transformed by subsequent work compared to simple citation counts.

Browse all 17 Text Generation papers →


🗣️ Dialogue Systems (26)

APEX-MEM: Agentic Semi-Structured Memory with Temporal Reasoning for Long-Term Conversational AI

The proposed system constructs long-term conversational memory using a trio of "domain-agnostic ontology-supported property graphs + append-only event storage + ReAct multi-tool retrieval agents." By never overwriting during construction and resolving temporal conflicts only at retrieval, it achieves 88.88% on LOCOMO (3.5% higher than MIRIX) and 86.2% on LongMemEval (13.7% higher than the strongest RAG baseline).

Author-in-the-Loop Response Generation and Evaluation: Integrating Author Expertise and Intent in Responses to Peer Review

This work redefines academic author response (rebuttal) generation as an "Author-in-the-Loop" task, introducing the Re3Align dataset (3.4K papers, 440K sentence-level edit annotations, 15K review-response-revision triplets), the REspGen controllable generation framework, and the REspEval evaluation suite with 20+ metrics. The approach systematically validates the effects of author input, controllability, and evaluation-guided refinement across 5 state-of-the-art LLMs.

Codebook-Injected Dialogue Segmentation for Multi-Utterance Constructs Annotation: LLM-Assisted and Gold-Label-Free Evaluation

The paper reformulates dialogue act annotation as a two-step "segment-then-label" problem. It proposes two approaches: codebook-injected LLM segmentation (System 1) and Dial-Start with DA-aware retrieval augmentation (System 2). It further introduces three categories of evaluation metrics that do not require gold boundaries (within-segment consistency, adjacent segment divergence, and human-AI distribution alignment). Experiments on TalkMoves and CLASS-annotated educational dialogues demonstrate that DA-aware prompting enables LLMs to produce more homogeneous segments, though coherence-based baselines and LLMs excel in different evaluation dimensions with no single optimal solution.

CoDial: Interpretable Task-Oriented Dialogue Systems Through Dialogue Flow Alignment

Ours proposes CoDial, a framework that converts predefined dialogue flows (task schemas) into structured heterogeneous graphs and automatically generates LLM guardrail code (such as Colang). It achieves interpretable and controllable task-oriented dialogue policies during inference, reaching SOTA on the STAR benchmark without requiring training data.

Cognitive Policy-Driven LLM for Diagnosis and Intervention of Cognitive Distortions in Emotional Support Conversation

The CoPoLLM framework is proposed, which constructs the first Emotional Support Conversation (ESC) dataset with cognitive distortion labels, CogBiasESC. By combining a Cognitive Policy Reinforcement Learning (CPRL) engine and Dual-Stream Condition Optimization (DSCO), the LLM can diagnose 8 types of cognitive distortions and generate policy-aware intervention responses, consistently outperforming 15 SOTA baselines.

Context-Agent: Dynamic Discourse Trees for Non-Linear Dialogue

The authors propose Context-Agent, which models multi-turn dialogue history as a "forest of discourse trees" (where each tree represents an independent topic and each branch represent an instruction refinement/fork). Nodes are organized by navigational intent rather than semantic similarity. Accompanying the model is the NTM benchmark for evaluating non-linear long-range dialogues, demonstrating improved task completion rates and reduced token consumption across various LLMs.

Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky

The DiaFORGE framework is proposed, featuring a disambiguation-centric synthetic data generation pipeline, reasoning-chain finetuning, and a dynamic evaluation system. This allows open-source LLMs to achieve a tool-calling success rate 27 percentage points higher than GPT-4o and 49 percentage points higher than Claude-3.5-Sonnet when facing near-duplicate enterprise APIs.

Discourse Coherence and Response-Guided Context Rewriting for Multi-Party Dialogue Generation

This paper proposes DRCR, the first framework to introduce context rewriting into multi-party dialogue generation. It utilizes dual feedback signals—discourse coherence and response quality—to construct preference data, enabling the rewriter and responder to mutually enhance each other through iterative dynamic self-evolution.

Dual Hierarchical Dialogue Policy Learning for Legal Inquisitive Conversational Agents

The authors define "Inquisitive Dialogue"—where an AI actively questions an uncooperative interlocutor, exemplified by U.S. Supreme Court justices questioning attorneys—and propose a Dual Hierarchical RL framework. This framework consists of an Appraisal Agent that scores attorney responses in real-time across 9 appraisal categories, and a Hierarchical Dialogue Agent that performs DDQN in a three-layer (act/subtype/utterance) Poincaré action space. Combined with triple rewards (goal-relevance, novelty, and conciseness) and a conservative regularization term, the method improves Probing Effectiveness (PES) from a baseline of 4.22 to 4.47 on the Oyez Supreme Court dataset, achieving the highest Coverage and MR in multi-turn scenarios.

ETHICMIND: A Risk-Aware Framework for Ethical-Emotional Alignment in Multi-Turn Dialogue

ETHICMIND proposes an inference-time risk-aware alignment framework that jointly analyzes ethical risks and user emotions in each turn of a multi-turn dialogue. It plans high-level response strategies to generate replies that balance ethical guidance and emotional resonance, achieving consistent alignment in high-risk and morally ambiguous scenarios without additional training.

Browse all 26 Dialogue Systems papers →


🌐 Multilingual & Translation (64)

A Multilingual Dataset and Empirical Validation for the Mutual Reinforcement Effect in Information Extraction

Constructs the first multilingual MRE Mix dataset (MMM, 21 subsets covering English, Chinese, and Japanese) and systematically validates that the Mutual Reinforcement Effect (MRE) between word-level and text-level information extraction tasks is cross-linguistically universal through large-scale ablation experiments.

Alexandria: A Multi-Domain Dialectal Arabic Machine Translation Dataset for Culturally Inclusive and Linguistically Diverse LLMs

Alexandria constructs a multi-turn Dialectal Arabic-English parallel dataset covering 13 Arabic countries, 11 social impact domains, and 107K turns. Through a community-driven human translation and revision process, it provides unprecedented fine-grained training and evaluation resources for Dialectal Arabic machine translation and systematically benchmarks 24 LLMs.

BabelDOC: Better Layout-Preserving PDF Translation via Intermediate Representation

BabelDOC is proposed as a layout-preserving PDF translation system based on an Intermediate Representation (IR) that decouples visual layout from semantic content. This allows NLP operations—such as LLM translation, terminology extraction, cross-page context awareness, and formula masking—to be performed at the semantic layer before being re-anchored to the original layout via an adaptive typesetting engine. On a 200-page benchmark, it outperforms PDFMathTranslate and DeepL Document Translation in BIoU, layout fidelity, and terminology consistency.

Beyond Literal Mapping: Benchmarking and Improving Non-Literal Evaluation Evaluation

The authors construct MENT, a meta-evaluation dataset for non-literal translation (7,530 human annotations), revealing the unreliability of traditional metrics and LLM-as-Judge in non-literal scenarios. They propose the RATE agentic evaluation framework, which improves correlation with human judgment by over 3.2 points through a reflective core agent that dynamically invokes functional sub-agents.

BhashaSutra: A Task-Centric Unified Survey of Indian NLP Datasets, Corpora, and Resources

The first unified survey specifically targeting Indic NLP resources, covering 200+ datasets, 50+ benchmarks, and 100+ models/tools. Organized by 17 task categories (from core language processing to socio-cultural tasks), it systematically analyzes persistent challenges such as uneven linguistic coverage, fragmented annotation, and inconsistent evaluation.

CLewR: Curriculum Learning with Restarts for Machine Translation Preference Learning

This paper proposes CLewR (Curriculum Learning with Restarts), a strategy that sorts data from easy to hard during preference optimization and restarts the curriculum every epoch. This effectively mitigates catastrophic forgetting and consistently improves machine translation performance across multiple model families (Gemma2, Qwen2.5, Llama3.1) and preference optimization algorithms (DPO, CPO, ARPO).

Cross-Cultural Transfer of Emoji Semantics and Sentiment in Financial Social Media

By systematically comparing emoji frequency, semantics, and sentiment polarity across 100 million financial microblogs in 4 languages, 2 platforms, and 2 asset classes, this study finds that while emoji frequency varies significantly across languages/platforms, their semantics and polarity remain highly stable. Consequently, in zero-shot sentiment transfer, incorporating emojis into text consistently reduces the cross-platform transfer gap from as high as 21% to nearly 0%.

DFKI-MLT at SemEval-2026 TASK 7: Steering Multilingual Models Towards Cultural Knowledge

This SemEval system paper utilizes the FLORES parallel corpus to extract language directions and injects language steering vectors into the residual stream of multilingual LLMs during inference. The system achieved an official MCQ accuracy of 86.96% (7th out of 17 teams), though post-hoc analysis indicates that gains are highly sensitive to layers, prompts, models, and locales.

Digitizing Nepal's Written Heritage: A Comprehensive HTR Pipeline for Old Nepali Manuscripts

This is the first end-to-end Handwritten Text Recognition (HTR) pipeline for Old Nepali. By employing a "Synthetic Devanagari → Printed Nagari → Old Nepali Manuscripts" three-stage transfer learning curriculum, \(8\times\) data augmentation with 20 techniques, byte-level BPE, and a script-aware decoder, the CER is reduced from a fine-tuned TrOCR baseline of \(9.6\%\) to \(4.9\%\). The code, models, and a Streamlit web application are open-sourced.

Efficient Low-Resource Language Adaptation via Multi-Source Dynamic Logit Fusion

TriMix decomposes Low-Resource Language (LRL) adaptation into three logit benefit vectors: "language capability + task capability + scaling dividends." It only requires continual pre-training (CPT) on a small model. At inference time, weights are dynamically determined via perplexity. It consistently outperforms single-model baselines and Proxy Tuning across 4 model families and 8 LRLs. A core empirical discovery is that "the weight of the small CPT model should be higher than that of the large instruction model," directly challenging the "large-model-dominant" assumption in Proxy Tuning.

Browse all 64 Multilingual & Translation papers →


🔍 Information Retrieval & RAG (73)

A Picture is Worth a Thousand Words? An Empirical Study of Aggregation Strategies for Visual Financial Document Retrieval

Through a carefully designed financial document diagnostic benchmark (single-digit perturbation + text masking), this study empirically proves that "aggregating VLM patch tokens into a single vector" causes vast semantic differences (e.g., $1.2M vs $7.2M) to collapse into nearly identical vectors with cosine similarity \(> 0.99\). The root cause is "global texture dominance," which various mitigation strategies and retrieval-tuned embeddings fail to resolve.

A Survey of Reasoning-Intensive Retrieval: Progress and Challenges

This paper systematically organizes the emerging direction of "Reasoning-Intensive Retrieval (RIR)." It provides the first comprehensive three-part survey—benchmarks, methods, and challenges—following the pipeline of query/index/retriever/reranker/iteration, and points out that current evaluations rely excessively on traditional IR metrics like nDCG.

Agentic Conversational Search with Contextualized Reasoning via Reinforcement Learning

ConvAgent is proposed to train conversational search agents to alternate between search and reasoning in multi-turn interactions by decomposing RL rewards into three complementary components: outcome reward, information gain reward, and mixed-initiative behavior reward.

All Languages Matter: Understanding and Mitigating Language Bias in Multilingual RAG

The study systematically reveals that multilingual RAG systems exhibit severe language bias (preference for English and query languages) during the reranking stage. It proposes the LAURA framework, which aligns the reranker via supervised signals driven by downstream generation quality, effectively mitigating bias and improving generation performance.

An Iterative Utility Judgment Framework Inspired by Philosophical Relevance via LLMs

Inspired by Schutz's philosophical theory of relevance, this paper proposes ITEM, an iterative utility judgment framework. By enabling dynamic interaction and mutual enhancement among three RAG components (relevance ranking, utility judgment, and answer generation), ITEM outperforms baselines in retrieval, utility judgment, and QA tasks.

AuthorityBench: Benchmarking LLM Authority Perception for Reliable Retrieval-Augmented Generation

AuthorityBench constructs the first LLM "authority perception" benchmark using 10K web domains (PageRank ground truth) + 22K entities (Wikipedia cross-lingual sitelink ground truth) + 120 RAG questions. The study finds that ListJudge / PairJudge + PointScore yields the most accurate outputs, adding web text can degrade performance, and utilizing authority signals for RAG filtering improves answer accuracy by up to 14 percentage points.

Bayesian Active Learning with Gaussian Processes Guided by LLM Relevance Scoring

BAGEL is proposed as a Bayesian active learning framework based on Gaussian Processes (GP). By using an exploration-exploitation balance strategy to propagate sparse LLM relevance signals across the global embedding space under a limited LLM budget, it achieves passage retrieval that significantly outperforms traditional LLM reranking methods.

Benchmarking and Enabling Efficient Chinese Medical Retrieval via Asymmetric Encoders

This paper proposes CMedTEB (Chinese Medical Text Embedding Benchmark) and CARE (Asymmetric Retrieval Framework). The former establishes a high-quality Chinese medical retrieval/reranking/STS benchmark through multi-LLM voting and expert validation. The latter utilizes an asymmetric architecture with a lightweight BERT for query encoding and a large LLM for document encoding, achieving LLM-level retrieval precision with BERT-level online latency through a two-stage progressive alignment strategy.

Beyond Black-Box Interventions: Latent Probing for Faithful Retrieval-Augmented Generation

ProbeRAG is proposed to address RAG faithfulness through the model's internal mechanisms by discovering the linear separability of conflicting/aligned knowledge in the LLM's latent space. It employs a three-stage framework: fine-grained knowledge pruning, latent space conflict probing, and conflict-aware attention.

Beyond Chunks and Graphs: Retrieval-Augmented Generation through Triplet-Driven Thinking

T2RAG replaces the minimum retrieval unit of RAG from "text chunks/KG nodes" with atomic triplets. Off-line, the corpus is extracted into a collection of triplet propositions for indexing. On-line, the LLM decomposes the question into searchable triplets with ? placeholders, iteratively retrieving evidence from the triplet library to fill in the blanks until all placeholders are resolved to generate the final answer. This achieves an average improvement of up to 11% across six datasets while reducing retrieval costs by up to 45%.

Browse all 73 Information Retrieval & RAG papers →


💻 Code Intelligence (50)

Across Programming Language Silos: A Study on Cross-Lingual Retrieval-Augmented Code Generation

This paper presents the first systematic study of cross-programming-language Retrieval-Augmented Code Generation (RACG). By constructing a 14K-instance dataset across 13 languages, the study reveals the asymmetry of cross-lingual knowledge transfer and its relationship with language affinity and pre-training diversity.

AutoMonitor-Bench: Evaluating the Reliability of LLM-Based Misbehavior Monitor

This paper constructs AutoMonitor-Bench, the first systematic benchmark for evaluating whether LLM-based monitors can reliably identify model misbehavior (3,010 paired samples covering safety violations, sycophancy/bias, and specification gaming). Evaluation across 22 open-source and closed-source monitoring models reveals a systematic trade-off between Miss Rate (MR) and False Alarm Rate (FAR). Furthermore, SFT experiments on 153k samples demonstrate that fine-tuning on easily constructed misbehavior fails to generalize to implicit specification gaming.

Benchmarking Testing in Automated Theorem Proving

Drawing inspiration from the concept of "integration testing" in software engineering, the semantic correctness of a generated theorem is determined by whether "all successor theorems depending on it still compile." This work constructs T2, a Lean 4 benchmark with 2206 problems, revealing a significant gap where mainstream LLMs achieve a 80%+ compilation rate but a semantic accuracy of only ~39%.

Bootstrapping Code Translation with Weighted Multilanguage Exploration

BootTrans proposes a bootstrapping multilingual code translation method that leverages test cases from a single hub language (Python) as cross-language verification oracles. Combined with a dual-pool architecture for experience collection to expand training data and a language-aware weighting mechanism to prioritize difficult translation directions, it significantly outperforms baselines on HumanEval-X and TransCoder-Test.

Can LLMs Compress (and Decompress)? Evaluating Code Understanding and Execution via Invertibility

This paper proposes RoundTripCodeEval (RTCE): a code reasoning benchmark using 4 lossless compression algorithms (LZW/AE/RLE/Huffman) to construct 250 inputs × 4 subtasks = 1000 strict round-trip (encode→decode must restore bit-exact data) tasks. Results show that even QwQ-32B achieves 0% EM on Huffman encoding, a failure that cannot be addressed by SFT or self-reflection.

ChatHLS: Towards Systematic Design Automation and Optimization for High-Level Synthesis

ChatHLS proposes a multi-agent HLS design framework. Through two core components—HLSTuner (QoR-aware reasoning for optimization pragma selection) and HLSFixer (a debugging framework enhanced by hierarchical feedback)—combined with a self-evolving error case expansion mechanism (VODA), it significantly outperforms baselines in both HLS-C generation success rates and hardware performance optimization.

ChipSeek: Optimizing Verilog Generation via EDA-Integrated Reinforcement Learning

ChipSeek proposes a hierarchical reward RL framework that integrates the EDA toolchain directly into the training loop. Through Curriculum-driven Dynamic Policy Optimization (CDPO), it enables LLMs to generate RTL code that meets both functional correctness and PPA (Power-Performance-Area) optimization objectives, achieving SOTA on standard benchmarks.

CodeDistiller: Automatically Generating Code Libraries for Scientific Coding Agents

CodeDistiller automatically distills scientific GitHub repositories into runnable and debugged example code libraries, enabling Code-RAG scientific discovery agents to utilize real-world domain tools; on 250 materials science repositories, the best model achieved a human-verified functional correctness rate of 74.1%, and downstream discovery tasks were more preferred by experts.

CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment

This paper proposes CodeRL+, which integrates execution semantics alignment into the RLVR training pipeline. By enabling models to infer variable-level execution trajectories, it bridges the gap between code textual representation and execution semantics. CodeRL+ achieves an average 4.6% improvement in pass@1 for code generation and improvements of 15.5% and 4.4% on code reasoning and test output generation benchmarks, respectively.

CodeWiki: Evaluating AI's Ability to Generate Holistic Documentation for Large-Scale Codebases

Ours proposes CodeWiki, an open-source framework based on hierarchical decomposition and recursive multi-agent processing for automatic repository-level code documentation generation. It also constructs the CodeWikiBench benchmark, where it surpasses the closed-source system DeepWiki (64.06%) with a quality score of 68.79% across seven programming languages.

Browse all 50 Code Intelligence papers →


🎨 Image Generation (5)

ANCHOR: LLM-driven Subject Conditioning for Text-to-Image Synthesis

This paper proposes the ANCHOR dataset, featuring 70K+ abstract captions from 5 news outlets to expose T2I model failures in multi-subject, contextual reasoning, and fine-grained grounding. It introduces SAFE, which utilizes LLMs to extract key subjects and reinforces subject representations at the embedding layer to enhance image-text consistency.

From AR to Diffusion: Efficiently Adapting Large Language Models with Strictly Causal and Elastic Horizons

This paper proposes FLUID, which efficiently adapts pre-trained autoregressive (AR) LLMs into diffusion-based parallel generation models using strictly causal attention and entropy-aware Elastic Horizons. With only 2.7B adaptation tokens, it achieves reasoning and code generation performance close to strong AR models and superior to existing diffusion baselines.

MENTOR: Efficient Autoregressive Image Generation with Balanced Multimodal Control

MENTOR utilizes a unified autoregressive decoder and two-stage multimodal training to align reference images and text instructions into the same generation prefix. With only 3M training data and a budget of approximately 1.5 days on 8 A100 GPUs, it achieves a superior balance between concept preservation and prompt following.

Multimodal Large Language Models for Multi-Subject In-Context Image Generation

This paper proposes MUSIC, which introduces the visual reasoning capabilities of Multimodal Large Language Models (MLLMs) into multi-subject in-context image generation. Through automated training data synthesis, visual CoT, and semantic-driven spatial layout planning, it significantly mitigates issues of subject omission, identity confusion, and semantic drift when generating multiple reference subjects simultaneously.

Think Bright, Diffuse Nice: Enhancing T2I-ICL via Inductive-Bias Hint Instruction and Query Contrastive Decoding

This paper proposes TBDN, a training-free framework that utilizes Hint Instruction to focus LVLMs on the final query and Query Contrastive Decoding to suppress prior-dominated hallucinations. By delivering more accurate textual descriptions to diffusion models, it significantly improves text-to-image in-context learning performance on CoBSAT and T2I Fast Mini-ImageNet.


🎬 Video Generation (4)

Accelerating Training of Autoregressive Video Generation Models via Local Optimization with Representation Continuity

The authors propose the Local Optimization + Representation Continuity (ReCo) training strategy. By optimizing within local windows and constraining smooth transitions of hidden states, they achieve a 2x acceleration in training autoregressive video generation models without sacrificing generation quality.

OSCBench: Benchmarking Object State Change in Text-to-Video Generation

The authors propose OSCBench—the first benchmark specifically designed to evaluate Object State Change (OSC) capabilities in text-to-video (T2V) models. Built on cooking scenarios with 1,120 prompts covering Regular, Novel, and Compositional scenarios, the benchmark reveals that even the strongest T2V models achieve an OSC accuracy of only 0.786.

Self-Correcting Text-to-Video Generation with Misalignment Detection and Localized Refinement

VideoRepair is introduced as the first training-free, model-agnostic self-correction framework for text-to-video generation. It utilizes MLLMs to detect fine-grained text-video misalignments, preserving correct regions while selectively refining problematic ones. It consistently improves alignment quality across four different T2V backbone models on EvalCrafter and T2V-CompBench.

TeachMaster: Generative Teaching via Code

TeachMaster proposes the Generative Teaching paradigm, using code as an interpretable intermediate representation for educational videos. It employs collaborating agents for planning, code generation, narration, debugging, synchronization, and layout to produce full-course videos, achieving near-human quality while reducing the production cost of a 45-hour course to approximately 0.3% of traditional methods.


🧩 Multimodal VLM (83)

A Survey of Deep Learning for Geometry Problem Solving

To be added after in-depth reading.

A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends

This paper systematically reviews Visually Rich Document Understanding (VRDU) based on Multimodal Large Language Models (MLLMs), categorizing OCR-based and OCR-free methods from two dimensions: feature representation/fusion and training paradigms, while discussing emerging directions such as data scarcity, multi-page documents, multilingual support, RAG, and agents.

AdaTooler-V: Adaptive Tool-Use for Images and Videos

This paper identifies a widespread blind tool-use problem in existing "thinking with images" MLLMs—models tend to force zoom-in or frame extraction for all visual questions, resulting in overthinking that degrades accuracy and increases inference costs. To address this, the authors propose AdaTooler-V, which introduces the AT-GRPO reinforcement learning algorithm. By using a sample-level Tool Benefit Score to dynamically adjust reward scales (encouraging tool use when effective and penalizing it when unnecessary), a 7B model achieves 89.8% on the V* high-resolution benchmark, surpassing GPT-4o and Gemini 1.5 Pro.

AFMRL: Attribute-Enhanced Fine-Grained Multi-Modal Representation Learning in E-commerce

The AFMRL framework is proposed, framing fine-grained understanding of e-commerce products as an attribute generation task. It enhances contrastive learning via key attributes generated by MLLM (AGCL) and back-optimizes the attribute generator using retrieval performance as a reward signal (RAR), achieving SOTA retrieval performance on large-scale e-commerce datasets.

AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis

This paper proposes AICA-Bench, a comprehensive benchmark covering three dimensions: Emotion Understanding (EU), Emotion Reasoning (ER), and Emotion-Guided Content Generation (EGCG). Evaluating 23 VLMs reveals two systematic flaws: intensity calibration failure and shallow descriptions. A training-free framework, GAT Prompting, is introduced to mitigate these issues.

Aligned Multi-View Scripts for Universal Chart-to-Code Generation

Utilizing "semantically equivalent scripts for the same chart in Python, R, and LaTeX" as a new supervision signal, this work constructs the 176K quadruplet dataset Chart2NCode. It proposes CharLuMA, a lightweight adapter that integrates a "language-conditional low-rank subspace router" into the LLaVA projector, enabling a single model to achieve high execution rates and visual fidelity across three plotting languages.

All Changes May Have Invariant Principles: Improving Ever-Shifting Harmful Meme Detection via Design Concept Reproduction

Ours proposes RepMD, a method that constructs Design Concept Graphs (DCG)—inspired by the concept of attack trees to describe the steps and logic used by malicious users to design harmful memes—to guide MLLMs in detecting evolving harmful memes, achieving 81.1% accuracy on GOAT-Bench.

Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models

BloomBench reconstructs VLM evaluation using Bloom’s cognitive taxonomy by organizing 7,747 bilingual image-text QA samples into 6 cognitive levels and 106 task types. It finds that high scores in current VLMs often mask significant shortcomings in factual recall, creative synthesis, and cross-lingual reasoning.

Automatic Slide Updating with User-Defined Dynamic Templates and Natural Language Instructions

Defines the new task of "dynamic slide updating on user-defined templates based on natural language instructions," constructs the DynaSlide benchmark containing 20,036 instruction-execution triplets, and proposes SlideAgent as a strong reference baseline.

Beyond Screenshots: Evaluating VLMs' Understanding of UI Animations

This work constructs AniMINT, the first evaluation set for UI animation understanding (300 densely annotated animation videos + 3 experts + 300 user annotations). After systematically testing nine SOTA VLMs, it was found that while basic motion effects are recognizable, significant gaps remain in functional classification and high-level semantic interpretation compared to humans. Furthermore, enhancing Gemini-2.5-Flash with Motion-Context-Perceptual Cues (MCPC) simultaneously improves classification and interpretation performance.

Browse all 83 Multimodal VLM papers →


🧠 VLM Reasoning (32)

A Survey of Multimodal Mathematical Reasoning: From Perception, Alignment to Reasoning

This survey proposes a complementary perspective consisting of the Perception–Alignment–Reasoning (PAR) process framework and the Answer–Process–Executable (APE) evaluation framework. It systematically organizes three task families—geometry, chart/table, and visual word problems—mapping existing methods and benchmarks onto these two coordinate axes. It represents the first process-centric survey on multimodal mathematical reasoning.

Addressing Overthinking in Large Vision-Language Models via Gated Perception-Reasoning Optimization

The GPRO framework is proposed to address overthinking in LVLMs by dynamically routing computation to three paths (Fast/Perception Re-check/Reasoning Reflection) at each token generation step through a meta-reasoning controller, simultaneously improving both accuracy and efficiency.

AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation

Ours proposes AnchorSeg, reframing reasoning segmentation as a structured conditional generation process based on a language-grounded query bank. It explicitly decouples spatial localization and semantic reasoning via anchor queries and a Token-Mask cycle consistency training objective, achieving SOTA on ReasonSeg (67.7% gIoU, 68.1% cIoU).

ArrowGEV: Grounding Events in Video via Learning the Arrow of Time

ArrowGEV is proposed, a reinforcement learning framework inspired by the "Arrow of Time" in physics. It models temporal directionality by distinguishing between time-sensitive and time-insensitive events, enhancing the event grounding accuracy and temporal understanding of VLMs.

Can MLLMs Reason Beyond Language? VisReason: A Comprehensive Benchmark for Vision-Centric Reasoning

VisReason constructs a multimodal benchmark containing 1,505 daily visual reasoning problems to specifically test whether models can reason directly based on visual evidence. Results show that even the strongest model achieves an average accuracy of only 47.5%, significantly lower than the human performance of 71.4%, and that CoT and larger reasoning budgets provide limited improvements.

CAPruner: Conceptual-Adjacent Scene Graph Pruner for Enhancing 3D Spatial Reasoning of Large Language Models

To address the conflict where feeding complete 3D scene graphs to LLMs leads to token explosion while existing distance-based KNN pruning often removes task-critical relations, this paper proposes CAPruner. It integrates "query semantic relevance" and "spatial proximity" into a lightweight MLP (only 1219 parameters) to score the importance of each edge. The model is trained via weak supervision by "aggregating edge weights into node weights" using only target object labels. Under a fixed edge budget, it preserves relations truly useful for specific 3D-VL tasks, significantly improving the spatial reasoning accuracy of downstream LLMs.

ChemVLR: Prioritizing Reasoning in Perception for Chemical Vision-Language Understanding

ChemVLR is proposed as the first reasoning-based VLM in the chemical domain. It constructs a 760K reasoning dataset via a cross-modal reverse engineering strategy and employs a three-stage training pipeline (CPT-SFT-RL), significantly outperforming proprietary models and domain-specific VLMs in molecular recognition and reaction prediction.

Decoding Scientific Experimental Images: The SPUR Benchmark for Perception, Understanding, and Reasoning

SPUR is the first benchmark designed for the "Perception \(\rightarrow\) Understanding \(\rightarrow\) Reasoning" three-stage evaluation of biomedical experimental images (multi-panel staining, Western blots, and statistical charts). It contains 4,264 expert-verified MCQs, revealing that current MLLMs (with Gemini 3 Pro Preview barely exceeding 60%) generally perform 12.76%–31.41% lower in quantitative reasoning than in qualitative reasoning.

Do MLLMs Understand Pointing? Benchmarking and Enhancing Referential Reasoning in Egocentric Vision

The authors construct EgoPoint-Bench, the first hybrid real+physical simulation benchmark for egocentric "finger pointing" QA (11.7k QA / 5 dimensions / 3 semantic referential levels). They confirm that current SOTA MLLMs generally rely on "visual proximity / saliency" pseudo-correlations rather than truly parsing the fingertip ray. Through LoRA fine-tuning on simulated data, they achieve an average improvement of up to +25 points and robust sim-to-real generalization.

DRIFT: Transferring Reasoning Priors for Efficient MLLM Fine-Tuning

DRIFT treats the "parameter difference between a text reasoning expert and a multimodal model" as a directional prior. During multimodal SFT backpropagation, it applies a lightweight bias to gradients (without modifying weights). Using only 4K multimodal CoT data and approximately 2 hours of training, it consistently pushes Qwen2.5-VL-7B performance on benchmarks like MathVista, MathVerse, and WeMath beyond parameter merging baselines and heavy SFT/RL methods.

Browse all 32 VLM Reasoning papers →


⚡ VLM Efficiency (6)

APB-V: Accelerating Long-Video Understanding via Sequence-Parallelism-aware Approximate Attention

APB-V accelerates long-video LMM inference using sequence-parallelism-aware approximate attention and system-level load balancing. While preserving full visual embeddings, it achieves speedups of 12.72×, 1.70×, and 1.18× compared to FlashAttn, ZigZagRing, and APB, respectively, under a 64-frame 1440p setting without significant performance loss.

From Inheritance to Saturation: Disentangling the Evolution of Visual Redundancy for Architecture-Aware MLLM Inference Acceleration

This work reveals two sources of visual redundancy in MLLM inference: Inherited Visual Redundancy (IVR) caused by dense ViT tokenization and Secondary Saturation Redundancy (SSR) caused by deep semantic saturation, which manifests differently across backbone architectures. The proposed HalfV framework handles these two types of redundancy separately, achieving a 4.1x FLOPs acceleration on Qwen2.5-VL while preserving 96.8% of the performance.

HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

This paper proposes HERMES, which conceptualizes KV cache as a hierarchical memory framework (shallow = sensory memory, middle = working memory, deep = long-term memory) based on a mechanistic analysis of MLLM decoder hierarchical attention preferences. It achieves training-free efficient streaming video understanding, maintaining or improving accuracy while reducing video tokens by 68%. The TTFT latency is <30ms, 10x faster than the previous SOTA.

HiPrune: Hierarchical Attention for Efficient Token Pruning in Vision-Language Models

This paper identifies a hierarchical attention pattern in vision encoders—middle layers focus on primary objects while deep layers capture global information. Based on this, it proposes HiPrune, a training-free and model-agnostic vision token pruning method. By selecting three types of tokens (Anchor/Buffer/Register) to preserve multi-level visual information, it maintains 99.3% performance using only 1/3 of the tokens, reducing FLOPs by 58.7%.

MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference

To address the "straggler" problem where Multimodal MoE models are bottlenecked by the "slowest expert" during Expert Parallelism (EP) inference, MACS re-estimates expert load using the Shannon entropy of visual tokens as semantic importance weights. It dynamically scales expert capacity based on the real-time modality composition of the batch. MACS is a training-free inference framework that maintains nearly identical performance (averaging 99.7% of vanilla MoE) across 12 multimodal benchmarks, significantly outperforming token-counting methods like CAI-MoE.

ReGATE: Learning Faster and Better with Fewer Tokens in MLLMs

ReGATE utilizes a frozen text-only teacher to estimate which output tokens require visual information, combined with the student's historical learning difficulty to dynamically select training tokens. This allows MLLMs to train faster with fewer tokens without changing architecture or adding parameters, achieving or exceeding standard fine-tuning performance on multiple image and video benchmarks.


🎵 Audio & Speech (72)

Affectron: Emotional Speech Synthesis with Affective and Contextually Aligned Nonverbal Vocalizations

This paper proposes the Affectron framework, which implements two train-time augmentation strategies—Emotion-Driven Top-K NV Matching and Emotion-Aware Top-K Routing—on small-scale open-source decoupled corpora. It achieves diverse and emotionally aligned synthesis of nonverbal vocalizations (NVs, e.g., laughter, sighs), significantly surpassing the VoiceCraft baseline based on pure linguistic pre-training.

An Exploration of Mamba for Speech Self-Supervised Models

This work presents the first comprehensive exploration of Mamba as a foundation model for speech self-supervised learning (SSL). It finds that Mamba-based HuBERT outperforms Transformers in long-context ASR, streaming ASR, and causal probing tasks while maintaining linear time complexity.

Analyzing Reasoning Shifts in Audio Deepfake Detection under Adversarial Attacks: The Reasoning Tax versus Shield Bifurcation

This paper designs a "three-dimensional forensic auditing" framework (Acoustic Perception / Cognitive Coherence / Cognitive Dissonance) for Audio Language Models (ALMs) performing deepfake detection with reasoning chains. It finds that CoT reasoning is not a universal enhancement—it acts as a "Reasoning Shield" for models with strong acoustic perception (Qwen2-Audio), but becomes a "Reasoning Tax" for those with weak perception (Gemma-3n, Phi-4). Furthermore, when a model is compromised, high cognitive dissonance can serve as a "silent alarm" to alert human auditors.

Anchored Cyclic Generation: A Novel Paradigm for Long-Sequence Symbolic Music Generation

This paper proposes the Anchored Cyclic Generation (ACG) paradigm, which alleviates error accumulation in long-sequence symbolic music generation by using confirmed musical content as anchors to calibrate the generation direction during the autoregressive process. A hierarchical framework, Hi-ACG, is constructed to achieve music generation from global structure to local details.

[b] = [d] − [t] + [p]: Self-supervised Speech Models Discover Phonological Vector Arithmetic

This work systematically demonstrates the existence of linear phonological feature vectors within the representation spaces of self-supervised speech models (S3M). These vectors satisfy word2vec-style vector arithmetic relationships, and their scaling correlates continuously with acoustic measurements.

Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs

Reveals that current AudioLLM perception weaknesses stem from ASR-centric training patterns (systemic suppression of paralinguistic and non-linguistic information). Proposes the Unified Audio Schema (UAS) to structure audio information into a JSON format across three dimensions: transcription, paralinguistics, and non-linguistic events. Achieving a 10.9% improvement in perception accuracy on the MMSU benchmark while maintaining reasoning capabilities.

Beyond Transcripts: A Renewed Perspective on Audio Chaptering

This paper systematically reconstructs the long-form audio chaptering task: advancing evaluation from transcript-dependent text space to transcript-invariant temporal space, and demonstrating that AudioSeg, utilizing direct audio representations, significantly outperforms text-based segmentation and existing MLLM solutions on YTSeg.

Closing the Modality Reasoning Gap for Speech Large Language Models

This paper introduces TARS (Trajectory Alignment for Reasoning in Speech), a reinforcement learning-based framework that aligns speech-conditioned reasoning trajectories with text-conditioned trajectories through two dense signals: representation alignment and behavior alignment. It achieves SOTA performance in 7B-scale models, with the Modality Recovery Rate (MRR) approaching or even exceeding 100%.

Comprehensive Benchmarking of Long-Form Speech Generation in Diverse Scenarios

This paper proposes SwanBench-Speech, which systematically evaluates long-form speech generation using 1,101 samples across 17 real-world downstream scenarios and 7 automatic evaluation dimensions. The study concludes that while current models approach usability in content accuracy, they still significantly lag behind real recordings in reverb consistency, long-range prosody, and expressive hierarchy.

Computational Narrative Understanding for Expressive Text-to-Speech

This paper extracts character direct quotes from audiobook fiction to construct LibriQuote, a large-scale expressive speech dataset (5.3K hours of quotes + 12.7K hours of narration). It annotates speaking styles using speech verbs and adverbs as pseudo-labels. Experiments demonstrate that fine-tuning flow-matching models improves both expressiveness and intelligibility, and LibriQuote-test serves as a challenging benchmark for expressive TTS.

Browse all 72 Audio & Speech papers →


🔎 AIGC Detection (17)

AEGIS: A Holistic Benchmark for Evaluating Forensic Analysis of AI-Generated Academic Images

AEGIS is the first comprehensive benchmark for academic image forgery forensics, covering 7 major academic image categories with 39 subcategories, 4 forgery strategies (entirely fabricated, reference-based rewriting, local inpainting, and local editing), and 25 generative models. It proposes four tasks: forgery scope discrimination, text artifact recognition, manipulation type classification, and tampered pixel localization. Evaluating 25 MLLMs and 9 expert models reveals a structural complementarity: even GPT-5.1 achieves an overall score of only 48.80%, and expert models reach a pixel IoU of only 30.09%, highlighting that "generation evolves faster than forensics" and the trade-off between "MLLM reasoning vs. expert model sensitivity."

Authorship Attribution in Multilingual Machine-Generated Texts

Existing research on machine-generated text authorship attribution (identifying which specific LLM or human produced a text) is almost entirely monolingual (primarily English). This paper is the first to formally define Multilingual Authorship Attribution (ML-MGT) and Cross-Lingual Authorship Attribution (CL-MGT). Through a systematic evaluation of 18 languages \(\times\) 8 generators (7 LLMs + human) using statistical methods, fine-tuned encoders, contrastive learning, and fine-tuned decoders, it finds that while fine-tuned/contrastive methods adapt well to multiple languages (best macro-F1 > 0.9), they degrade severely when transferring across different language families or writing systems, revealing the challenges of real-world multilingual scenarios.

Beyond the Final Actor: Modeling the Dual Roles of Creator and Editor for Fine-Grained LLM-Generated Text Detection

Ours proposes RACE (Rhetorical Analysis for Creator-Editor Modeling), which utilizes Rhetorical Structure Theory (RST) to construct logic graphs for modeling the thought architecture of the "Creator," while extracting discourse unit-level features to capture the linguistic style of the "Editor." This enables four-way fine-grained LLM-generated text detection (Human-written / LLM-generated / LLM-polished Human / Human-rewritten LLM).

BIASEDTALES-ML: A Multilingual Dataset for Analyzing Narrative Attribute Distributions in LLM-Generated Stories

BiasedTales-ML constructs a corpus of approximately 350,000 LLM-generated children's stories across 8 languages. Through a factorial prompt design and a distributional analysis framework, it reveals that the distribution of social attributes in narratives varies significantly across different languages, and English-centric evaluations fail to reflect bias patterns in multilingual scenarios.

C-ReD: A Comprehensive Chinese Benchmark for AI-Generated Text Detection Derived from Real-World Prompts

C-ReD constructs a Chinese AI-generated text detection benchmark covering five writing scenarios, nine LLM generators, and real-world prompts. It demonstrates that detection difficulty depends heavily on the domain, generator, and prompt, while fine-tuning on C-ReD significantly enhances generalization to unseen models and external Chinese data.

Can AI-Generated Persuasion Be Detected? Persuaficial Benchmark and AI vs. Human Linguistic Differences

This paper introduces Persuaficial—a high-quality multilingual benchmark for AI-generated persuasive text covering six languages. It systematically evaluates the differences in automatic detection difficulty between LLM-generated and human-written persuasive texts, finding that subtle AI persuasion is significantly harder to detect than human persuasion (\(F_1\) drops by approximately 20%), whereas overly intensified persuasion is actually easier to identify.

DetectRL-X: Towards Reliable Multilingual and Real-World LLM-Generated Text Detection

DetectRL-X constructs a benchmark containing 3.456 million samples across multiple languages, domains, attacks, and lengths with parallel binary/ternary classification, proving that existing detectors still have significant robustness gaps in real-world multilingual and human-AI collaborative writing scenarios.

ExaGPT: Example-Based Machine-Generated Text Detection for Human Interpretability

ExaGPT reframes the task of "determining whether a text is human-written or LLM-generated" as "identifying which side has more similar spans in a data store." By utilizing BERT embeddings, k-NN retrieval, and dynamic programming for optimal span segmentation, it provides interpretable evidence (most similar retrieved span examples) while improving accuracy by up to \(+37.0\) points over previous explainable detectors at 1% FPR.

Frame In, Frame Out: Measuring Framing Bias in LLM-Generated News Summaries

This paper proposes FIFO, a method that uses an LLM jury with expert calibration to measure whether LLM news summaries introduce framing bias on XSum at scale. It finds that several high-capacity models exhibit higher proportions of framed expressions compared to human summary baselines.

From Scoring to Explanations: Evaluating SHAP and LLM Rationales for Rubric-based Teaching Quality Assessment

This paper proposes a sentence-level explanation evaluation framework for automated rubric scoring. Comparing fine-tuned PLMs, prompted LLMs, SHAP attribution, and LLM rationales on a classroom feedback quality scoring task, the study finds that fine-tuned PLMs are more accurate, while SHAP provides more faithful and transferable explanations than LLM-generated ones.

Browse all 17 AIGC Detection papers →


🤖 Robotics & Embodied AI (11)

Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents

SkillNav decomposes the vision-language navigation task into 5 atomic skills (Direction Adjustment, Vertical Movement, Stop, Landmark Identification, Area Identification) + 1 Temporal Order Planning skill. Each skill fine-tunes a DUET sub-agent using synthetic data, while a training-free VLM router performs temporal reordering + sub-goal localization + skill selection. It achieves SOTA generalization capabilities on GSA-R2R (Test-N-Scene SPL 48% vs. the previous highest of 43%).

Cultivating Forensic Reasoning for Generalizable Multimodal Manipulation Detection

This paper proposes REFORM, shifting multimodal forgery detection from "direct label fitting" to "learning a verifiable forensic reasoning process." Through the ROM reasoning-annotated dataset, dual decoders, and GRPO training, REFORM achieves superior cross-domain generalization and interpretable detection results on ROM, DGM4, and MMFakeBench.

ElasticFlow: One-Step Physics-Consistent Policy with Elastic Time Horizons for Language-Guided Manipulation

The paper proposes ElasticFlow, which replaces instantaneous velocity fields with MeanFlow (mean velocity fields) for learning language-conditioned robotic actions. By explicitly encoding control granularity using an "Elastic Time Horizon \(\Delta t=t-r\)", it achieves 1-NFE single-step inference (~71Hz) and outperforms OpenVLA and \(\pi_0\) on long-horizon tasks such as LIBERO-Long and CALVIN ABC-D.

GoViG: Goal-Conditioned Visual Navigation Instruction Generation via Multimodal Reasoning

GoViG proposes a new task of generating navigation instructions based only on initial and goal egocentric observations. It decomposes the task into two steps: "imagining intermediate frames then writing instructions." By jointly training Anole-7B with a dual objective of token-level MSE and label-smoothing CE, and employing one-pass or interleaved multimodal reasoning strategies, the method improves the BLEU-4 score from a baseline of 0.08 to 0.32, maintaining 0.27 on cross-domain real-world videos.

GROKE: Vision-Free Navigation Instruction Evaluation via Graph Reasoning on OpenStreetMap

GROKE proposes evaluating navigation instructions without any vision by serializing OpenStreetMap (OSM) data into JSON and utilizing Gemini-3 Pro as a follower agent to execute instructions on the graph. Navigation metrics (Navigation Error / SR / SDTW) serve as proxies for instruction quality. Compared to heuristic baselines on Map2Seq, it reduces Navigation Error (NE) by 68.5%, and results show that NE is significantly correlated with human judgment of "instruction clarity" (\(r = -0.31, p < 0.01\)).

Libra-VLA: Achieving Learning Equilibrium via Asynchronous Coarse-to-Fine Dual-System

Libra-VLA decomposes robot actions into a hybrid action space of "discrete macro-intent + continuous micro-pose." It utilizes System 2 (VLM + parallel coarse-action head) for low-frequency planning and System 1 (diffusion transformer + independent SigLIP encoder) for high-frequency refinement. Achieving true asynchronous execution via an intent buffer, it reaches a SoTA of 97.2% on LIBERO and 79.5% zero-shot on LIBERO-Plus (10% higher than the previous OpenVLA-OFT+).

Limited Linguistic Diversity in Embodied AI Datasets

This paper performs a systematic "linguistic diversity audit" on mainstream VLA training corpora (RT-1, BRIDGE, TacoPlay, Language Table, LIBERO). By quantifying lexical, semantic, and syntactic dimensions, it reveals that VLA data contains < 2% unique instructions, RT-1 has only 49 unique words in the entire corpus, and negation/conditional sentences account for < 1%. This "template-based poverty" compared to instruction-tuning corpora (OASST2 93%, Alpaca 99.8% unique) may be the root cause of VLA models' vulnerability to paraphrasing and generalization failures.

Mango: Multi-Agent Web Navigation via Global-View Optimization

Mango constructs a global approximate structure of a website before navigation and employs Thompson Sampling to dynamically allocate a limited navigation budget among candidate URLs. This prevents LLM web agents from blindly exploring from the homepage and significantly outperforms baselines such as AgentOccam and WebWalker on WebVoyager and WebWalkerQA.

VLN-NF: Feasibility-Aware Vision-and-Language Navigation with False-Premise Instructions

This paper proposes the VLN-NF benchmark—the first task requiring VLN agents to identify false-premise instructions and output NOT-FOUND in 3D partially observable environments. It further introduces the REV-SPL evaluation metric and the ROAM two-stage hybrid framework, where ROAM achieves 6.1 REV-SPL, representing a 45% improvement over supervised baselines.

When Does Language Matter? Multilingual Instructions Reveal Step-wise Language Sensitivity in Vision-Language-Action Models

By translating the LIBERO robotic manipulation benchmark into ten languages, this paper systematically reveals for the first time that VLA models suffer a 30–50% drop in success rates under non-English instructions. It identifies that "linguistic influence is highly non-uniform across execution steps"—where only a few critical steps are sensitive to language but dominate failure cases. Based on this, a method for inference-time representation alignment specifically on these steps is proposed, significantly recovering multilingual performance.

Browse all 11 Robotics & Embodied AI papers →


🎮 Reinforcement Learning (46)

A Goal Without a Plan Is Just a Wish: Efficient and Effective Global Planner Training for Long-Horizon Agent Tasks (EAGLET)

EAGLET decouples long-horizon agent tasks into "global planner + local executor" modules. It trains a plug-and-play planner through a two-step pipeline: "cold-start SFT with homologous consensus filtering" followed by "GRPO fine-tuning using executor capability gain as reward." It achieves new SOTA on three long-horizon benchmarks while reducing training costs to 1/8 of RL baselines.

A Survey of Reinforcement Learning for Large Language Models under Data Scarcity: Challenges and Solutions

The first systematic survey of Reinforcement Learning (RL) for LLMs under data scarcity, proposing a three-layer taxonomy: data-centric, training-centric, and framework-centric. It covers directions such as data pruning/synthesis/compression, trajectory generation/reward engineering/policy optimization, and self-evolution/co-evolution/multi-agent evolution.

Adaptive Instruction Composition for Automated LLM Red-Teaming

The Adaptive Instruction Composition (AIC) framework is proposed, utilizing Neural Thompson Sampling to adaptively select attack instructions within a combinatorial space of crowdsourced harmful queries and jailbreak tactics. By simultaneously optimizing attack success rate and diversity, it significantly outperforms existing methods on Harmbench.

ARGUS: Policy-Adaptive Ad Governance via Evolving Reinforcement with Adversarial Umpiring

ARGUS utilizes a Prosecutor–Defender–Umpire three-agent debate combined with GRPO reinforcement learning. This enables the ad-review VLM to correct historical "outdated labels" and uncover latent violations in gray zones when policies are updated. Industrial A/B testing shows a relative 35.2% reduction in the Violation Leakage Rate (VLR).

AttnPO: Attention-Guided Process Supervision for Efficient Reasoning

Ours proposes AttnPO, a low-overhead process-supervised RL framework that leverages the model's intrinsic attention signals for step-level credit assignment. By identifying Key-Focus Heads (KFH) to distinguish between redundant and critical reasoning steps, AttnPO significantly reduces reasoning length while substantially improving accuracy.

Beyond Fully Random Masking: Attention-Guided Denoising and Optimization for Diffusion Language Models

This paper identifies that in diffusion language models (dLLMs), "tokens that attend more to determined contexts exhibit more stable generation and are more critical for reasoning." Consequently, it proposes AGDO—a method that derives denoising order from attention and emphasizes these "attention hub" tokens via weighting during supervised fine-tuning (SFT) and reinforcement learning (RL). This approach consistently outperforms existing post-training methods for dLLMs that rely on random masking in mathematical and code reasoning tasks.

Beyond Majority Voting: Towards Fine-grained and More Reliable Reward Signal for Test-Time Reinforcement Learning

Addressing the "confirmation bias + sparse reward" issues in TTRL caused by using majority voting for pseudo-labels, SCOPE proposes step-wise confidence-weighted voting (moving beyond frequency-based selection) and Pareto-optimal dynamic subgroup partitioning (bootstrapping local consensus in independent subgroups). On Qwen3-8B, it improves AIME 2024 from 47.13 → 52.70 and AIME 2025 from 27.40 → 31.00.

Breaking the Impasse: Dual-Scale Evolutionary Policy Training for Social Language Agents

To address the "evolution impasse" in open-ended social language games (Negotiation / Don't Say It / Two Dollar Game) within self-play RLVR—where agent behavior homogenization leads to deterministic match outcome distributions and vanishing gradient signals—this paper proposes DEPT. It utilizes a fast/slow dual-timescale EMA baseline to detect stagnation and applies asymmetric advantage reshaping to suppress dominant outcomes while amplifying rare trajectories. This method boosts the negotiation win rate on Qwen3-4B/8B-Base from 16-20% to 32%, with simultaneous benefits observed on OOD math and reasoning benchmarks.

Bridging SFT and RL: Dynamic Policy Optimization for Robust Reasoning

Ours proposes DYPO (Dynamic Policy Optimization), which routes samples to different optimization paths based on dynamic difficulty grading—Hard samples utilize multi-teacher distillation to reduce SFT bias, while Mid samples use Group Alignment Loss to reduce RL variance. This achieves an average gain of 4.8% on mathematical reasoning benchmarks and 13.3% on OOD tasks.

CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning

The CE-GPPO algorithm is proposed. By reintroducing gradient signals for low-probability tokens outside the PPO clipping interval through stop-gradient operations, it achieves fine-grained coordinated control of policy entropy and attains a better balance between exploration and exploitation.

Browse all 46 Reinforcement Learning papers →


🎁 Recommender Systems (22)

Bridging Language and Items for Retrieval and Recommendation: Benchmarking LLMs as Semantic Encoders

This paper introduces the Amazon Reviews 2023 large-scale dataset (570M reviews / 48M items) and constructs the BLaIR benchmark. Covering Sequential Recommendation, Collaborative Filtering, and Item Search (short and complex queries), the study benchmarks 11 top-tier LLMs as semantic encoders. It reveals that model rankings on BLaIR are almost uncorrelated with MTEB (Spearman -0.476), highlighting the unique requirements of recommendation scenarios for semantic encoders.

ClusterRAG: Cluster-Based Collaborative Filtering for Personalized Retrieval-Augmented Generation

ClusterRAG introduces collaborative filtering into personalized RAG by constructing user representations from historical documents and clustering them with HDBSCAN. It hierarchically retrieves profile documents from both the target user and similar users to compose prompts, enabling the hybrid mode to outperform vanillaRAG, LaMP-IPA, ROPG, and CFRAG across the LaMP multi-task benchmark.

Culinary Crossroads: A RAG Framework for Enhancing Diversity in Cross-Cultural Recipe Adaptation

Authors observe that standard RAG "produces non-diverse outputs even when given diverse contexts" in creative tasks. They design CARRIAGE, a plug-and-play framework featuring query rewriting, diversity-aware MMR re-ranking, sliding-window dynamic context, and contrastive context injection. This framework effectively transfers "contextual diversity" to "output diversity," improving lexical/semantic/ingredient diversity and CultureScore in Spanish cross-national recipe adaptation, achieving Pareto efficiency compared to closed-book LLMs.

Decisive: Guiding User Decisions with Optimal Preference Elicitation from Unstructured Documents

The DECISIVE interactive decision-making framework is proposed to extract objective option scoring matrices from unstructured documents. By combining Bayesian preference inference with adaptive pairwise comparison questions, the system efficiently learns the user's latent preference vector. This achieves transparent personalized recommendations while minimizing interaction burden, improving decision accuracy by up to 20% over strong baselines.

From Past To Path: Masked History Learning for Next-Item Prediction in Generative Recommendation

Proposes the Masked History Learning (MHL) training framework, which incorporates a masked history reconstruction auxiliary task into the autoregressive training of generative recommendation. Combined with an entropy-guided adaptive masking strategy and a curriculum learning scheduler, it shifts the model from merely predicting "what is next" to understanding "why this path was formed," significantly outperforming SOTA on three datasets.

From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents

This paper proposes the Memora benchmark and the FAMA metric, extending long-term memory evaluation from shallow factual retrieval to memory consolidation and mutation handling across weeks to months, revealing systemic failures of existing LLMs and memory agents in handling frequent knowledge updates.

GraphLoRA: Structure-Aware Low-Rank Adaptation for Large Language Model Recommendation

Existing LLM recommenders either feed collaborative information into prompts or inject pre-trained static embeddings into LoRA weights, treating structure as a "one-read" static input. GraphLoRA embeds a trainable graph message passing network into the LoRA bottleneck (between down-projection \(\mathbf{A}\) and up-projection \(\mathbf{B}\)), allowing collaborative topology to propagate dynamically within the parameter space and directly guide weight updates. With only ~1.67% additional parameters, it outperforms SOTAs like CoRA on ML-1M and Amazon-Book.

HARPO: Hierarchical Agentic Reasoning for User-Aligned Conversational Recommendation

Proposes the HARPO framework, which redefines conversational recommendation as a structured decision-making problem optimized for recommendation quality. Through four components—hierarchical preference learning, value-network-guided tree search reasoning, virtual tool operations, and multi-agent refinement—it significantly outperforms existing methods on the ReDial, INSPIRED, and MUSE benchmarks.

HORIZON: A Benchmark for in-the-wild User Behaviour Modeling

This paper proposes HORIZON, the first fully open-source large-scale cross-domain long-term recommendation benchmark. Based on merged Amazon Reviews, it constructs a unified interaction history containing 54M users and 35M items. It designs a four-quadrant evaluation protocol decoupled along the time axis and user dimension, revealing that models like BERT4Rec perform strongly in-distribution but significantly degrade in temporal extrapolation and unseen user scenarios. Furthermore, LLMs do not consistently outperform specialized architectures in user behavior modeling.

HSUGA: LLM-Enhanced Recommendation with Hierarchical Semantic Understanding and Group-Aware Alignment

HSUGA decouples and enhances the two core stages of LLM-enhanced sequential recommendation. It adopts the HSU module, which uses "staged processing + four atomic edits (Add/Delete/Update/Retain)," to stabilize semantic extraction from long sequences. It also introduces GAA self-distillation alignment, which groups users by activity (top 20% active / 80% long-tail) to address under-supervision for long-tail users and over-alignment for active users. As a plug-and-play solution, it yields performance gains across Steam/Fashion/Beauty datasets using GRU4Rec/BERT4Rec/SASRec backbones.

Browse all 22 Recommender Systems papers →


🔗 Causal Inference (7)

Better and Worse with Scale: How Contextual Entrainment Diverges with Model Size

This paper establishes the first scaling laws for the "contextual entrainment effect," discovering that larger models are more resistant to false information in semantic contexts (negative exponent) but more prone to copying irrelevant tokens in non-semantic contexts (positive exponent), revealing opposing scaling behaviors between semantic filtering and mechanical copying functions.

ClimateCause: Complex and Implicit Causal Structures in Climate Reports

ClimateCause constructs the first expert-annotated dataset for complex and implicit causal structures in climate reports (874 causal relations), supporting nested causality, multi-event decomposition, correlation direction, and spatio-temporal context labeling. It proposes a readability metric based on causal graph semantic complexity, with LLM benchmarking revealing that causal chain reasoning remains a significant challenge.

Evaluating Counterfactual Strategic Reasoning in Large Language Models

This paper evaluates the strategic adaptation capabilities of LLMs using label perturbations, payoff perturbations, and joint counterfactual versions of the Repeated Prisoner's Dilemma and Rock-Paper-Scissors. It finds that while many models appear proficient in familiar games, they continue to apply templated strategies even after payoff structures are altered.

Function Words as Statistical Cues for Language Learning

The authors use Universal Dependencies corpora across 186 languages to demonstrate that three distributional properties—"high frequency + syntactic predictability + phrase boundary alignment"—are cross-linguistically universal. Simultaneously, they construct seven counterfactual variants of English to train GPT-2 small, proving that transformer learners perform best only when all three properties are satisfied. They identify a Goldilocks effect: function words must be both sufficiently frequent and sufficiently diverse to be both reliable and discriminative.

iTAG: Inverse Design for Natural Text Generation with Accurate Causal Graph Annotations

The iTAG framework is proposed, which utilizes a three-stage inverse design pipeline (parameterized causal graph construction → CoT-based concept assignment → structure-preserving text generation) to generate data with both extremely high causal graph annotation accuracy and text naturalness. This serves as a practical substitute for real annotated data in benchmarking text causal discovery algorithms.

Learning Invariant Modality Representation for Robust Multimodal Learning from a Causal Inference Perspective

This paper proposes CmIR (Causal Modality Invariant Representation learning), which explicitly disentangles each modality into causal invariant representations and environment-specific spurious representations based on causal inference theory. Through an elegant objective function combining invariance constraints, mutual information constraints, and reconstruction constraints, it ensures that invariant representations maintain stable predictive relationships across environments. It achieves SOTA performance in multimodal sentiment, humor, and sarcasm detection, particularly excelling in OOD and noisy scenarios.

Parallel Universes, Parallel Languages: A Comprehensive Study on LLM-based Multilingual Counterfactual Example Generation

This paper systematically investigates the multilingual counterfactual generation capabilities of LLMs across six languages. By comparing direct generation and translation-based paths, it finds that the translation path yields higher label flip rates but requires more edits. It identifies four common error patterns and validates that multilingual counterfactual data augmentation outperforms cross-lingual augmentation, particularly for low-resource languages.


🔬 Interpretability (63)

A Structured Clustering Approach for Inducing Media Narratives

The paper proposes a framework to automatically induce media narrative patterns from large-scale news corpora. By jointly modeling causal event chains and role information (Hero/Threat/Victim), it utilizes a role-constrained clustering algorithm to organize narrative chains into semantically coherent patterns. It generates interpretable narrative patterns consistent with framing theory in the domains of immigration and gun control.

A Systematic Comparison between Extractive Self-Explanations and Human Rationales in Text Classification

This paper systematically compares the differences between extractive self-explanations generated by four open-source instruction-tuned LLMs across three types of text classification tasks, human rationales, and post-hoc attribution methods. The study finds that the consistency between self-explanations and human annotations is strongly influenced by text length and task complexity; however, in perturbation-based faithfulness evaluations, self-explanations often identify a subset of tokens more critical to the model's prediction.

AdaptiveK: Complexity-Driven Sparse Autoencoders for Interpretable Language Model Representations

AdaptiveK proposes a Sparse Autoencoder driven by input semantic complexity, allowing simple text to activate fewer features and complex text to activate more. Across experiments on eight autoregressive LLMs and additional architectures, it improves reconstruction quality, conceptual decoupling, and training efficiency while reducing the need for repetitive hyperparameter tuning common in fixed TopK approaches.

Aligning What LLMs Do and Say: Towards Self-Consistent Explanations

Constructed the Post-hoc Self-Consistency Bank (PSCB, 85K decisions × 428K explanations) to quantify the feature attribution gap between LLM answers and their natural language explanations. Improved attribution consistency through DPO optimization without compromising model accuracy.

Compositional Steering of Large Language Models with Steering Tokens

This paper proposes compositional steering tokens, which compress behavior instructions into embedding vectors in the input space via self-distillation. By training a dedicated compositional token <and> to capture the universal concept of "composition," the method demonstrates strong generalization capabilities across unseen behavior combinations, unseen behaviors, and an unseen number of combined behaviors.

Constructing Interpretable Features from Compositional Neuron Groups

The authors utilize Semi-Nonnegative Matrix Factorization (SNMF) to directly decompose MLP activations into "sparse neuron groups × non-negative coefficients," yielding interpretable features that map back to activation contexts and combine across layers. Evaluations of concept steering on Llama-3.1-8B / Gemma-2-2B / GPT-2 comprehensively outperform the latest SAEs (Llamascope / Gemmascope) and the strongly supervised baseline, DiffMeans.

Crosscoding Through Time: Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining

By training a shared feature dictionary across multiple pretraining checkpoints of the same LLM using a sparse crosscoder, this work proposes the Relative Indirect Effect (RelIE) to measure how the causal importance of individual features "emerges, persists, or vanishes" over token counts. This study provides the first observation of the concept-level evolutionary trajectory in Pythia, OLMo, and BLOOM—from "specific subword detectors" to "internalized abstract syntactic/cross-lingual detectors."

Curing "Miracle Steps" in LLM Mathematical Reasoning with Rubric Rewards

This paper identifies the widespread presence of "Miracle Steps"—phenomena where reasoning chains leap to the correct answer without derivation—in current LLM mathematical reasoning. It proposes the Rubric Reward Model (RRM), a process-based reward function using problem-specific scoring rubrics. During RL training, RRM significantly reduces Miracle Steps by 71% and improves the Verified Pass@1024 on AIME2024 from 26.7% to 62.6%.

Diffusion-CAM: Faithful Visual Explanations for dMLLMs

Diffusion-CAM is proposed as the first interpretability method specifically designed for diffusion-based Multimodal Large Language Models (dMLLMs). By extracting structurally valid intermediate representations from denoising trajectories and employing four post-processing modules (Adaptive Kernel Denoising, Distribution-aware Confidence Gating, Contextual Background Attenuation, and Single-instance Causal Debiasing), it significantly outperforms autoregressive CAM baselines on COCO Caption and GranDf.

Do LLMs Capture Embodied Cognition and Cultural Variation? Cross-Linguistic Evidence from Demonstratives

The authors use demonstratives such as "this/that" and "这/那" as probes to construct a bilingual English-Chinese dataset (80 items/language × 4 cues × 4 perspectives × 5 scenarios). By establishing a human baseline from 6,400 responses from 320 native speakers, the study finds that English speakers excel at proximal–distal differentiation but are weaker in other-perspective taking, while Chinese speakers show the opposite pattern. In contrast, five SOTA LLMs failed to consistently distinguish between proximal and distal categories and exhibited no cross-cultural variation, generally reverting to English-centric reasoning or "All of the above" safety fallbacks.

Browse all 63 Interpretability papers →


📦 Model Compression (59)

A BERTology View of LLM Orchestrations: Token- and Layer-Selective Probes for Efficient Single-Pass Classification

This paper treats the \(token \times layer\) hidden state tensor of a production LLM as a minable resource. By utilizing a two-stage aggregation probe that "compresses tokens first, then layers," it performs safety/sentiment classification within the same forward pass. With only 35M trainable parameters, it approaches the performance of standalone guard models while eliminating an extra LLM call.

A Layer-wise Analysis of Supervised Fine-Tuning

This work conducts a layer-wise analysis of SFT in 1B-32B models through information-theoretic, geometric, and optimization perspectives. It finds that instruction-following capabilities are concentrated in the middle layers (20%-80%) rather than being uniformly distributed. Based on this, a Mid-Block Efficient Tuning strategy is proposed to selectively update middle layers, achieving up to a 10.2% improvement on GSM8K over standard LoRA.

Adaptive Layer Selection for Layer-Wise Token Pruning in LLM Inference

Ours proposes ASL (Adaptive Selection Layer), which adaptively determines the layer location for KV cache pruning by monitoring the variance of token attention score rankings. It significantly outperforms fixed-layer selection methods on difficult tasks while remaining training-free.

Alignment Tuning for Large Language Models: A Data-Centric Lens on Alignment Data Pipelines

This paper reinterprets LLM alignment tuning as a dynamic data pipeline design problem: what the model ultimately learns depends not only on optimization algorithms like PPO, DPO, or GRPO, but also on how candidate responses are generated, how preferences are evaluated, and how preference signals are instantiated as training objectives.

Analytical FFN-to-MoE Restructuring via Activation Pattern Analysis

This paper proposes an analytical post-training framework that rapidly restructures dense FFNs into sparse MoEs through neuron activation pattern analysis. By distinguishing high-frequency shared experts from low-frequency routed experts and constructing routers derived from activation statistics, the method achieves a 1.17× speedup with fine-tuning on only 2k samples.

ArcLight: A Lightweight LLM Inference Architecture for Many-Core CPUs

ArcLight is a lightweight LLM inference framework written from scratch (approximately 10 C++ files) designed for many-core CPUs with multiple NUMA nodes. By utilizing NUMA-local memory pools, multi-view thread pools, cross-NUMA tensor parallelism, and asynchronous subgraph synchronization, it breaks the "remote memory wall." On a 192-core ARM Kunpeng platform, it improves the decode throughput of Qwen3-4B Q4_0 by up to 46% compared to llama.cpp.

BaseCal: Unsupervised Confidence Calibration via Base Model Signals

Observing that base LLMs remain well-calibrated on free-form QA while post-trained LLMs (PoLLMs) are severely overconfident, BaseCal proposes two unsupervised schemes—feeding PoLLM's answers into the base LLM to use token probabilities as confidence (BaseCal-ReEval), or using a linear projection layer to map PoLLM's final hidden states back to the base LLM space and passing them through the base output layer (BaseCal-Proj). This achieves an average 42.9% relative reduction in ECE compared to the best unsupervised baseline across 5 datasets \(\times\) 3 model families.

Calibrated Speculative Decoding: Frequency-Guided Candidate Selection for Efficient Inference

CSD proposes a training-free enhancement framework for speculative decoding. It utilizes Online Correction Memory (OCM) to record high-frequency rejection patterns for rescuing candidates, and employs Semantic Consistency Gating (SCG) to verify candidate reliability based on probability ratios. This approach improves speculative decoding throughput by up to 2.33× while simultaneously increasing accuracy on HumanEval and MATH500.

CBRS: Cognitive Blood Request System with Bilingual Dataset and Dual-Layer Filtering

CBRS proposes a multi-platform framework that efficiently detects and parses blood donation requests from social media streams via a dual-layer filtering architecture (lightweight classifier + LLM). It constructs the first dataset containing 11K Bengali-English-Transliterated Bengali blood donation requests, where a LoRA-fine-tuned Llama-3.2-3B achieves a 92% zero-shot accuracy in parsing tasks.

Cognitive-Uncertainty Guided Knowledge Distillation for Accurate Classification of Student Misconceptions

This paper proposes a two-stage knowledge distillation framework using "Dual-level Marginal Sample Selection" based on teacher cognitive uncertainty and a difficulty-adaptive loss. Utilizing only 10.30% of real samples for incremental training, the 4B student model achieves MAP@3 = 0.9585 (+17.8% Gain) on MAP-Charting. On a benchmark of 220 middle school algebra misconceptions, it reaches 84.38% accuracy, surpassing GPT-5 (67.73%) and the directly fine-tuned 72B teacher (81.25%), while being 23× faster during inference.

Browse all 59 Model Compression papers →


🕸️ Graph Learning (24)

AgentGL: Towards Agentic Graph Learning with LLMs via Reinforcement Learning

AgentGL is proposed as the first reinforcement learning-based Agentic Graph Learning (AGL) framework. It enables LLM agents to autonomously navigate Text-Attributed Graphs (TAGs) using graph-native search tools, achieving absolute accuracy improvements of up to 17.5% in node classification and 28.4% in link prediction.

ARK: Answer-Centric Retriever Tuning via KG-augmented Curriculum Learning

The ARK framework is proposed, which filters positive samples through a three-dimensional answer sufficiency score (Forward + Backward + Retriever alignment) and utilizes LLM-constructed Knowledge Graphs (KG) to generate hard negative samples of progressive difficulty for curriculum contrastive learning. It achieves an average F1 improvement of 14.5% across 10 datasets.

Autonomous Knowledge Graph Exploration with Adaptive Breadth-Depth Retrieval

This paper proposes ARK: a training-free Knowledge Graph (KG) retrieval agent that exposes only two minimal tools—"global lexical search" and "single-hop neighbor expansion"—allowing the LLM to autonomously switch between breadth and depth without seed nodes or fixed hop counts. It pushes the average Hit@1 on three STaRK graphs to 59.1%, achieving up to a 31.4% improvement over training-free baselines, and enables label-free strategy distillation into Qwen3-8B.

AutoPKG: An Automated Framework for Dynamic E-commerce Product-Attribute Knowledge Graph Construction

AutoPKG is proposed as a multi-agent LLM framework for automatically constructing a Product-Attribute Knowledge Graph (PKG) from multimodal e-commerce content. Using a Type Induction Agent, Attribute Key Discovery Agent, Attribute Value Extraction Agent, and a centralized KGD decision agent, it enables continuous evolution and normalization of a dynamic ontology. It achieves 0.953 WKE (Type) and 0.724 WKE (Key) on the Lazada dataset, with a 7.89% recommendation GMV gain in online A/B testing.

CoG: Controllable Graph Reasoning via Relational Blueprints and Failure-Aware Refinement over Knowledge Graphs

CoG is a training-free KGQA framework that applies Kahneman's Dual-Process Theory to KG reasoning: System 1 distills SPARQL from the training set offline into a "Relational Blueprint" template library, which serves as a soft structural constraint online to guide the reranking and pruning of candidate relations; System 2 triggers evidence-conditioned reflection and targeted backtracking when search stalls, correcting early erroneous decisions. It achieves SOTA accuracy on three multi-hop KGQA benchmarks (GPT-4 backbone: CWQ 77.8, WebQSP 89.7, GrailQA 86.4) while maintaining lower costs (CWQ requires 13% fewer tokens and 12% fewer calls than PoG).

Collaboration of Fusion and Independence: Hypercomplex-driven Robust Multi-Modal Knowledge Graph Completion

M-Hyper encodes multi-modal knowledge graph entities into four orthogonal bases of a biquaternion, carrying three independent modalities (Structure/Visual/Textual) and one fused modality respectively. Through the Hamilton product, it simultaneously achieves "modal independence preservation" and "pairwise sufficient interaction," outperforming 18 baselines on DB15K, MKG-W, and MKG-Y datasets with minimal memory usage and training time.

Comparing Human and Large Language Model Interpretation of Implicit Information

This paper proposes the Implicit Information Extraction (IIE) task and an LLM-based three-stage extraction pipeline (Information Extraction → Reasoning Verification → Temporal Analysis). It constructs structured knowledge graphs to represent the implicit meanings of text. Through comparisons with crowdsourced human judgments, it finds that LLMs are more conservative than humans in socially rich contexts, while humans are more conservative in short factual contexts.

ComplianceNLP: Knowledge-Graph-Augmented RAG for Multi-Framework Regulatory Gap Detection

ComplianceNLP is an end-to-end financial regulatory compliance system that constructs a knowledge graph from 12,847 SEC / MiFID II / Basel III regulations to enhance RAG retrieval. Combined with LEGAL-BERT-based multi-task obligation extraction and threshold-scored gap analysis, it outperforms GPT-4o+RAG by 3.5 points on RegObligation / GapBench with an 87.7 F1. It achieves \(2.8\times\) inference acceleration via domain-specific knowledge distillation + Medusa speculative decoding. Over four months of parallel operation, it processed 9,847 updates, reaching a 96.0% recall rate and a 3.1× increase in analyst efficiency.

CRAFTQA: A Code-Driven Adaptive Framework for Complex Structured Data Reasoning

CRAFTQA uses CodeSTEP to generate executable step-by-step Python reasoning code. When predefined operations are insufficient, CRAFT dynamically generates custom functions, significantly enhancing complex structured data QA capabilities across tables, knowledge graphs (KGs), and temporal knowledge graphs (TKGs). The GPT-4o version achieves 76.6% on the complex reasoning Overall metric.

EA-Agent: A Structured Multi-Step Reasoning Agent for Entity Alignment

This paper proposes EA-Agent, which decomposes Entity Alignment (EA) into a structured multi-step reasoning process. By planning and executing a tool pool (triplet selector + alignment tool + reflector), it achieves interpretable alignment decisions. Combined with reward-guided offline policy optimization to continuously improve planning capabilities, it achieves a Hits@1 improvement of up to 3.17% on DBP15K while mitigating efficiency issues caused by redundant triplets.

Browse all 24 Graph Learning papers →


📈 Time Series (8)

A Unified Framework for Modeling Heterogeneous Financial Data via Dual-Granularity Prompting

The FinLangNet framework is proposed, utilizing a dual-module architecture (DeepFM for static features and a Transformer with a dual-granularity prompting mechanism for temporal behavior) to achieve multi-scale credit risk prediction. Its deployment on the Didi Finance platform resulted in a 6.3pp increase in KS and a 9.9% reduction in the bad debt rate.

ODTQA-FoRe: An Open-Domain Tabular Question Answering Dataset for Future Data Forecasting and Reasoning

ODTQA-FoRe introduces an open-domain tabular question answering task focused on future numerical forecasting and post-forecast reasoning. It provides the TimeFore three-agent framework, which chains table retrieval, SQL data acquisition, specialized time-series forecasting, and answer normalization into an evaluable baseline.

STK-Adapter: Incorporating Evolving Graph and Event Chain for Temporal Knowledge Graph Extrapolation

This paper proposes STK-Adapter, which embeds three MoE modules in each layer of a Large Language Model (LLM)—ST-MoE for capturing spatio-temporal structures, EA-MoE for modeling event chain semantics, and CMA-MoE for deep cross-modal alignment. It addresses the issues of spatio-temporal information loss and layer-wise dilution caused by shallow alignment between TKG embeddings and LLMs, significantly outperforming SOTA on four benchmark datasets.

STReasoner: Empowering LLMs for Spatio-Temporal Reasoning in Time Series via Spatial-Aware Reinforcement Learning

STReasoner utilizes Network SDEs to synthesize spatio-temporal time series data with graph structures and textual semantics. By integrating a time-series encoder, a three-stage training pipeline, and a spatial-aware S-GRPO, the model learns to perform explicit reasoning based on temporal dynamics and spatial dependencies.

Temporal Leakage in Search-Engine Date-Filtered Web Retrieval: A Retrospective Forecasting Case Study

This paper systematically audits the date filters of Google and DuckDuckGo, finding that search engine date filtering fails significantly in retrospective forecasting (RF) evaluations—\(71\%\) (Google) and \(81\%\) (DuckDuckGo) of questions contain at least one page with major post-cutoff information leakage, causing prediction Brier scores to artificially drop from \(0.24\) to \(0.10\).

Test of Time: Rethinking Temporal Signal of Benchmark Contamination

This paper demonstrates that "performance decay after cutoff" is not robust evidence of benchmark contamination: as long as the same set of source documents is converted from original fill-in-the-blank questions to LLM-rephrased questions, the temporal decay signal changes significantly or even disappears.

Time-RA: Towards Time Series Reasoning for Anomaly Diagnosis with LLM Feedback

Defining the new Time-RA task, this work upgrades time series anomaly detection from binary classification to generative reasoning diagnosis (detection + classification + root cause explanation). It constructs RATs40K, the first multimodal benchmark comprising ~40,000 samples across 10 domains and 20 anomaly types, validating the feasibility of this paradigm through an AI feedback labeling pipeline and LLM fine-tuning.

TSAQA: Time Series Analysis Question And Answering Benchmark

TSAQA is a unified time series question answering benchmark: it casts 6 types of temporal analysis tasks (anomaly detection, classification, representation, comparison, data transformation, and temporal relations) into 3 closed-form question types (true/false TF, multiple-choice MC, and the newly proposed puzzling PZ). Across 13 domains with 210k samples, LLMs and time series foundation models are evaluated under a unified zero-shot protocol—results indicate that even the strongest commercial model, Gemini-2.5-Flash, achieves an average accuracy of only 65.08%, leaving significant room for improvement.


🩺 Medical LLM (47)

"Excuse Me, May I Say Something…" CoLabScience: A Proactive AI Assistant for Biomedical Discovery

CoLabScience utilizes the PULI (Positive-Unlabeled Learning for Intervention) framework to train an LLM assistant capable of proactively deciding when and how to intervene in biomedical team discussions. It leverages GRPO and an RL coordinator to automatically identify optimal intervention timings and generate scientific suggestions from streaming dialogues.

Anonpsy: A Graph-Based Framework for Structure-Preserving De-identification of Psychiatric Narratives

Anonpsy is proposed to redefine the de-identification of psychiatric narratives as a graph-guided semantic rewriting problem—narratives are first converted into semantic graphs, then constrained perturbations are performed on the graph to modify identity information while preserving clinical structure, followed by narrative reconstruction through graph-conditional generation.

Beyond Prompt: Fine-grained Simulation of Cognitively Impaired Standardized Patients via Stochastic Steering

This paper proposes StsPatient, which simulates standardized patients across various cognitive impairment domains and severity levels by extracting domain-specific Steering Vectors from contrastive instruction/response pairs. Combined with a Stochastic Token Modulation (STM) mechanism to control injection probability, it achieves an average improvement of 11.23% in clinical authenticity compared to prompt engineering methods and exceeds the best baseline by 18.54% in severity controllability.

Beyond the Individual: Virtualizing Multi-Disciplinary Reasoning for Clinical Intake via Collaborative Agents

The proposed Aegle framework virtualizes Multi-Disciplinary Teams (MDT) through a graph-structured multi-agent architecture. By introducing decoupled parallel reasoning and dynamic topology into the clinical intake process, it outperforms SOTA models on 53 metrics across 24 clinical departments.

Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models

The authors propose MedCheck—the first evaluation framework for the lifecycle of medical LLM benchmarks, decomposing benchmark construction into 5 stages with a total of 46 criteria. Auditing 56 medical benchmarks using this framework reveals three systemic issues: (1) 50% do not align with any medical standards (ICD/SNOMED), (2) 88% do not handle data contamination, and (3) 89% do not test model robustness while 91% do not test uncertainty—concluding that current "leaderboard progress" is largely an illusion.

BioHiCL: Hierarchical Multi-Label Contrastive Learning for Biomedical Retrieval with MeSH Labels

BioHiCL utilizes hierarchical multi-label annotations of MeSH (Medical Subject Headings) to provide structured supervision for dense retrievers. By aligning the embedding space with the MeSH semantic space through depth-weighted label similarity, a 0.1B model outperforms most specialized models on biomedical retrieval, sentence similarity, and question-answering tasks.

Calibrated? Not for Everyone: How Sexual Orientation and Religious Markers Distort LLM Accuracy and Confidence in Medical QA

Ours investigates how social identity markers (sexual orientation and religious beliefs) distort the accuracy and confidence calibration of LLMs in medical QA. It is found that "homosexual" markers consistently lead to performance degradation and calibration crises across 9 LLMs, and intersectional identities produce non-additive, specific harm.

Can Continual Pre-training Bridge the Performance Gap between General-purpose and Specialized Language Models in the Medical Domain?

This paper constructs a high-quality German medical corpus, FineMed-de (7.3 million documents / 5.1 billion tokens filtered from FineWeb2), performs continual pre-training and SLERP model merging on three LLMs (7B-24B) to create the DeFineMed model family. It demonstrates that domain-specialized 7B models can significantly bridge the performance gap with general 24B models on German medical tasks (win rate improved by ~3.5x).

CT-FineBench: A Diagnostic Fidelity Benchmark for Fine-Grained Evaluation of CT Report Generation

The authors decompose the ambiguous question of "quality of a CT report" into a QA checklist of "whether each fine-grained attribute of every finding matches," constructing the CT-FineBench benchmark with 44k questions. Its sensitivity to clinical errors and correlation with human expert scores significantly outperform existing metrics such as BLEU, BERTScore, RadGraph, RaTEScore, and GREEN.

CT-Flow: Orchestrating CT Interpretation Workflow with Model Context Protocol Servers

The authors remodel 3D CT interpretation as an agentic task where "radiologists iteratively explore via tools." By exposing four categories of tools—Data Ingestion, Global Navigation, Detailed Observation, and Advanced Analysis—through the Model Context Protocol (MCP), they construct CT-FlowBench with 2000+300 executable trajectories. They subsequently perform SFT to develop CT-Flow-8B, which achieves 69.46% ACC on 3D-RAD (a +22.46% improvement over slice-only baselines) with a tool name error rate of only 0.007/case.

Browse all 47 Medical LLM papers →


🧬 Computational Biology (5)

AROMA: Augmented Reasoning Over a Multimodal Architecture for Virtual Cell Genetic Perturbation Modeling

The AROMA framework is proposed, which integrates text evidence, knowledge graph topology, and protein sequence features in a multimodal architecture. Combined with a two-stage training strategy (SFT + GRPO), it achieves interpretable and precise genetic perturbation effect prediction.

BioTool: A Comprehensive Tool-Calling Dataset for Enhancing Biomedical Capabilities of Large Language Models

BioTool constructs an instruction fine-tuning dataset consisting of 7,040 human-verified "query–API call" pairs covering 34 commonly used tools from three major biomedical databases: NCBI, Ensembl, and UniProt. After fine-tuning 4B-scale open-source LLMs with this data, the tool-calling quality exceeds commercial models such as GPT-5.1, Gemini-3 Pro, and Claude-4.5-Sonnet by over 15%.

ChemAmp: Amplified Chemistry Tools via Composable Agents

This paper proposes the "Tool Amplification" paradigm (distinct from traditional tool orchestration). Through the ChemAmp framework, chemistry-specific tools (UniMol2, Chemformer, etc.) are treated as composable building blocks to dynamically construct task-specific super-agents. It outperforms specialized models and general LLMs on four core chemistry tasks, including molecular design and reaction prediction, while reducing inference token costs by 94%.

ProtoCycle: Reflective Tool-Augmented Planning for Text-Guided Protein Design

ProtoCycle proposes a reflective agent framework that uses an LLM as a planner combined with a lightweight tool environment for text-guided protein sequence design. It replaces one-shot text-to-sequence generation with a multi-round "plan-tool-evaluate-reflect" cycle. On Mol-Instructions, it improves ProTrek to 14.681 and Retrieval to 0.936, achieving language alignment that nears or exceeds specialized protein design models using only ~2,000 SFT trajectories and online RL.

ToxReason: A Benchmark for Mechanistic Chemical Toxicity Reasoning via Adverse Outcome Pathway

This paper introduces ToxReason, a benchmark for mechanistic chemical toxicity reasoning based on the Adverse Outcome Pathway (AOP) framework. It integrates drug-target experimental data with toxicity labels, requiring models to reason from Molecular Initiating Events (MIE) to organ-level Adverse Outcomes (AO). A 4B model trained with GRPO reinforcement learning outperforms large models like GPT-5 in both toxicity prediction (F1 71.4%) and reasoning quality.


👥 Social Computing (45)

Among Us: Language of Conspiracy Theorists on Mainstream Reddit

Analyzing 10 years of longitudinal data from 510 million Reddit comments, the study finds that users active in conspiracy communities exhibit detectable unique linguistic patterns even in mainstream communities (average 87% classification accuracy). However, these patterns are highly dependent on community context, with community-specific models outperforming global models by up to 17 percentage points.

Bayesian Social Deduction with Graph-Informed Language Models

This paper proposes GRAIL (Graph Reasoning Agent Informed through Language), a hybrid reasoning framework that externalizes probabilistic reasoning to a factor graph model while utilizing LLMs for language understanding and interaction. GRAIL defeated human players for the first time in the social deduction game Avalon (67% win rate), with resource consumption significantly lower than large-scale reasoning models.

Beyond the Crowd: LLM-Augmented Community Notes for Governing Health Misinformation

The authors perform an empirical analysis of 30.8K health-related Community Notes from X, revealing systematic slow-response issues: a median delay of 17.6 hours for the first helpful verdict and 87.9% of notes remaining unrated. They propose the CrowdNotes+ framework, utilizing (1) Evidence Augmentation and (2) Utility-Guided Automation modes for LLM-generated notes, paired with a "Relevance → Correctness → Helpfulness" three-stage evaluation. 15 LLMs on the new HealthNotes benchmark significantly outperform the 73.19% helpfulness of human notes (with the o3 model reaching 81.15%).

BITS Pilani at SemEval-2026 Task 9: Structured Supervised Fine-Tuning with DPO Refinement for Polarization Detection

This paper proposes a two-stage pipeline consisting of "structured slot-filling SFT + DPO preference optimization" for the SemEval-2026 POLAR polarization detection task (English subset). The Qwen2.5-7B system submitted during the competition achieved a Macro-F1 of 0.7664. Post-competition, replacing the base model with Mistral-Nemo-12B and using preference pairs filtered by an LLM-judge improved the Macro-F1 to 0.8162, surpassing the organiser baseline (0.7802).

Building Arabic NLP from the Ground Up: Twenty Years of Lessons, Failures, and Open Problems

This is a reflective paper rather than an experimental one. The authors review twenty years of Arabic NLP construction, pointing out that the most difficult problems in low-resource languages are often not linguistics or model technology, but community, institutions, deployment governance, and modes of knowledge production.

ClaimDB: A Fact Verification Benchmark over Large Structured Data

ClaimDB is the first fact-verification benchmark to scale evidence to 80 real-world databases, averaging 11 tables / 4.6 million rows / 110 million tokens per claim. This forces methods to utilize executable programs (SQL) for compositional reasoning. Evaluations of tool-calling agents across 30 SOTA LLMs reveal that over half have an accuracy below 55%; closed models rarely "abstain," while open-source models over-abstain, identifying NEI handling as the primary weakness.

Confident, Calibrated, or Complicit: Safety Alignment and Ideological Bias in LLM Hate Speech Detection

The authors evaluated 5 LLMs (strongly aligned vs. weakly aligned) under 4 political personas on the Latent Hatred benchmark using zero-shot classification. They found that strongly aligned models achieved higher strict accuracy (69.0%) compared to weakly aligned ones (64.1%) and were nearly immune to persona manipulation. However, all models exhibited systematic failures in handling irony, target group fairness, and confidence calibration.

Content Fuzzing for Escaping Information Cocoons on Social Media

Proposes ContentFuzz, a confidence-guided fuzzing framework from the content creator's perspective. It uses LLMs to rewrite posts to flip machine-inferred stance labels while keeping human-interpreted meaning unchanged, thereby breaking social media information cocoons.

Decide less, communicate more: On the construct validity of end-to-end fact-checking in medicine

The authors conducted an annotation study of 1,000 instances using 5 clinical experts on authentic claims from RedHOT (Reddit Health Discussions). They found that end-to-end medical fact-checking lacks construct validity due to three insurmountable barriers: difficulties in linking evidence, underspecified claims, and subjective severity judgments. Consequently, the paper proposes reframing medical fact-checking as an "interactive clinician-patient communication model" rather than a "classification-then-verdict" pipeline.

DIA-HARM: Dialectal Disparities in Harmful Content Detection Across 50 English Dialects

This paper constructs DIA-HARM, the first benchmark to evaluate the robustness of misinformation detection across 50 English dialects. It reveals that human-written dialectal content leads to a performance drop of 1.4-3.6% F1, while fine-tuned Transformers significantly outperform zero-shot LLMs (96.6% vs 78.3%). Furthermore, some models exhibit catastrophic degradation exceeding 33% on mixed content.

Browse all 45 Social Computing papers →


🛡️ AI Safety (5)

OmniCompliance-100K: A Multi-Domain Rule-Grounded Real-World Safety Compliance Dataset

This paper constructs OmniCompliance-100K, the first large-scale, multi-domain safety compliance dataset grounded in real-world cases. It contains 12,985 human-curated regulatory/policy rules and 106,009 real-world compliance cases collected via web search agents, covering nine domains such as AI safety, data privacy, finance, and healthcare. Extensive benchmarking reveals systemic shortcomings in the safety compliance capabilities of current LLMs.

On the (In-)Security of the Shuffling Defense in the Transformer Secure Inference

This paper demonstrates that the commonly used "expose intermediate activations after shuffling" defense in Transformer secure inference is insecure. It proposes an attack that first aligns activations under different random permutations and then solves linear equations to extract weights. The attack recovers approximately usable model weights for Pythia-70m and GPT-2 with a query cost of approximately $1.

Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF

This paper proposes Reverse Constitutional AI (R-CAI), which synthesizes automated, controllable, and multi-dimensional adversarial toxic data by inverting the principles of Constitutional AI into a "Toxic Constitution." Combined with a critique-revision loop and a probability-clamped RLAIF mechanism, R-CAI effectively mitigates semantic degradation caused by reward hacking, achieving a 15% improvement in semantic coherence.

Signals Are Not States: Neuro-Symbolic Safeguards for Culturally Aware Classroom AI

The paper argues that classroom AI should not directly interpret culturally contextualized signals such as "silence, averted gaze, or code-switching" as educational judgments like "low engagement, inattention, or low ability." It proposes the NSCR neuro-symbolic framework: mapping multimodal signals into typed facts with uncertainty, provenance, and cultural scope, followed by executable reasoning and governance policies to generate evidence-based claims, while actively deferring (DEFER) when evidence is insufficient or stereotype risks are high.

UniVid: A Unified Vision-Language Model for Video Moderation

UniVid evolves video moderation systems from unmaintainable "fragmented" architectures to interpretable, reusable "end-to-end" systems by replacing 1000+ black-box classifiers with a unified policy-aware captioning VLM, achieving a 42.7% reduction in violation leakage during production deployment on the ByteDance platform.


🗂 More Areas (5)


🔄 Self-Supervised Learning (1)

LLMSurgeon: Diagnosing Data Mixture of Large Language Models

LLMSurgeon formalizes the question "what data was this LLM trained on" as Data Mixture Surgery. By using the soft confusion matrix of a proxy classifier to invert the domain distribution within generated text, it estimates pre-training data mixture proportions while only requiring access to model outputs.


📂 Others (4)

Automated Knowledge Component Generation and Interpretable Knowledge Tracing in Coding Problems

This paper utilizes LLMs to automatically generate and cluster Knowledge Components (KCs) for open-ended programming problems. It proposes KCGen-KT, which converts student mastery of each KC into soft tokens as input for Llama 3, improving both correctness prediction and student code generation performance on CodeWorkout and FalconCode.

Neural Induction of Finite-State Transducers

To be added after reading the paper in depth.

NSF-SciFy: Mining the NSF Awards Database for Scientific Claims

NSF-SciFy extracts 2.8M scientific claims and investigation proposals from NSF award abstracts, building a resource orders of magnitude larger than existing scientific claim datasets and demonstrating significant performance gains for claim and proposal extraction models.

Qayyem: A Real-time Platform for Scoring Proficiency of Arabic Essays

Qayyem is the first web platform supporting cross-prompt multi-trait automated essay scoring for Arabic. It integrates various scoring schemes ranging from feature engineering to SOTA neural models, supporting end-to-end academic writing assessment workflows.