Skip to content

💬 LLM (Other)

💬 ACL2026 · 62 paper notes

📌 Same area in other venues: 📷 CVPR2026 (2) · 🔬 ICLR2026 (56) · 🧪 ICML2026 (39) · 🤖 AAAI2026 (29) · 🧠 NeurIPS2025 (53) · 📹 ICCV2025 (6)

🔥 Top topics: LLM ×18 · Diffusion Models ×3 · Few-/Zero-Shot Learning ×2 · Agents ×2 · Personalized Generation ×2

A Study of LLMs' Preferences for Libraries and Programming Languages

This study presents the first systematic investigation into the preferences of 8 LLMs regarding libraries and programming languages during code generation. It reveals that LLMs exhibit a severe bias toward popular libraries like NumPy (45% unnecessary usage) and the Python language (chosen in 58% of high-performance tasks), and that natural language recommendations often diverge from actual code selection behavior.

Adam's Law: Textual Frequency Law on Large Language Models

This paper proposes the "Textual Frequency Law" (TFL), revealing that for identical semantics, utilizing higher-frequency textual expressions to prompt or fine-tune LLMs yields superior performance. It further introduces frequency distillation and curriculum training strategies to leverage this law.

AlphaContext: An Evolutionary Tree-based Psychometric Context Generator for Creativity Assessment

AlphaContext is proposed as an evolutionary tree-based psychometric context generator. Through four modules—HyperTree outline planning, MCTS sentence-by-sentence generation, MAP-Elites diversity optimization, and assessment-guided iterative refinement—it automatically generates high-quality long-text contexts for creativity assessment, outperforming baseline methods by an average of 8% across seven evaluation dimensions.

An Existence Proof for Neural Language Models That Can Explain Garden-Path Effects via Surprisal

By fine-tuning neural language models on garden-path sentences, this work demonstrates the existence of a neural LM capable of simultaneously explaining garden-path effects and natural reading times via surprisal, providing an existence proof for Surprisal Theory.

Automatic Combination of Sample Selection Strategies for Few-Shot Learning

This paper proposes the ACSESS method, which automatically identifies and combines complementary sample selection strategies through three mechanisms: forward selection, backward selection, and Datamodels. Validated across 23 strategies, 5 ICL models, 3 gradient-based few-shot learning methods, and 14 datasets (6 text, 8 image), the combined strategy consistently outperforms single strategies and ICL-specific baselines.

Big AI is Accelerating the Metacrisis: What Can We Do?

In this ACL 2026 position paper, Steven Bird argues that "Big AI"—industrialized LLM engineering driven by a few giants—is simultaneously accelerating three interconnected crises: the ecological crisis, the meaning crisis, and the language crisis. Given that ACL is the primary publisher of LLM research, it must shift from "individual compliance" to "collective action of a professional community." The author proposes seven specific reforms for ACL, including prioritizing public interest, resisting corporate capture, protecting critical NLP, and establishing an NLP policy track.

C-World: A Computer Use Agent Environment Creator

The authors formalize the "agent environment" as an Action / Task / Transition / Reward quadruple and implement it as C-World. It utilizes 5,571 real MCP tools, automated task synthesis, state controller perturbations, and dual-signal rewards for high-fidelity evaluation. Furthermore, it employs a "World Engine" to simulate tool responses without live APIs, enabling scalable training. Evaluation of 9 frontier LLMs reveals that "planning is generally strong while execution is generally weak." Fine-tuning with as few as 1,170 C-World trajectories outperforms a baseline trained on 119k samples.

Can AI Be a Good Peer Reviewer? A Survey of Peer Review Process, Evaluation, and the Future

The authors provide a systematic survey of AI-assisted peer review methods in the LLM era. They categorize "review generation" into four paradigms: fine-tuning / agent / RL / generation enhancement, classify "after-review" into rebuttal / meta-review / paper revision, and present a four-quadrant evaluation taxonomy (human / reference-based / LLM-based / aspect-oriented). Finally, they discuss the future across six directions: novelty, automatic evaluation, cross-domain, multimodality, and ethics.

CAST: Achieving Stable LLM-based Text Analysis for Data Analytics

The CAST framework is proposed to constrain the potential reasoning paths of LLMs through two mechanisms: Algorithmic Prompting and Thinking-before-Speaking. This significantly enhances inter-run stability for text summarization and labeling tasks without sacrificing output quality.

Characterizing the Expressivity of Local Attention in Transformers

The authors utilize Linear Temporal Logic (LTL) as a unified characterization tool to strictly prove the following equivalences: global-only Transformer \(\leftrightarrow \mathrm{LTL}[\mathrm{P}]\), \(k\)-local-only \(\leftrightarrow \mathrm{LTL}[\mathrm{Y}^{\leq k}]\), and hybrid global+local \(\leftrightarrow \mathrm{LTL}[\mathrm{P}, \mathrm{Y}^{\leq k}]\). Consequently, they demonstrate that local and global expressivities are incomparable, hybrid models are strictly more powerful, and 1-local is the most expressive within the local family. Theoretical predictions are empirically validated on synthetic regular languages and WikiText-2.

Clozing the Gap: Exploring Why Language Model Surprisal Outperforms Cloze Surprisal

The paper systematically compares cloze responses and GPT-2 surprisal regarding their explanatory power for human word-by-word reading times. Through three types of probabilistic interventions, it demonstrates that the advantage of LM surprisal primarily stems from higher probability resolution, the ability to distinguish semantically similar words, and the assignment of fine-grained probabilities to low-frequency words.

Clustered Self-Assessment: A Simple yet Effective Method for Uncertainty Quantification in Large Language Models

This paper proposes Clustered Self-Assessment: a method that first clusters multiple sampled LLM responses into mutually exclusive semantic options, and then prompts the same LLM to assign confidence scores to original answers via multiple-choice question (MCQ) probabilities. This approach yields superior AUROC and Brier calibration performance over baselines like semantic entropy and \(P(\text{True})\) on TQA, NQ, and XSum.

Confidence Estimation for LLMs in Multi-turn Interactions

This paper presents the first systematic study of LLM confidence estimation in multi-turn dialogue scenarios. It proposes two core desiderata (per-turn calibration and monotonicity with increasing information), the corresponding InfoECE metric and Kendall’s \(\tau\) evaluation, and the Hinter-Guesser dataset construction paradigm. A novel P(SUFFICIENT) logit probe is introduced. Findings indicate that existing methods (verbalized / SC / P(TRUE)) exhibit poor calibration and monotonicity in multi-turn settings. In contrast, P(SUFFICIENT) reduces InfoECE to 5.27 on the GUESS task (vs. 79.97 for P(TRUE)) and achieves a \(\tau\) of 81.51, although the task remains far from solved.

CoSToM: Causal-oriented Steering for Intrinsic Theory-of-Mind Alignment in Large Language Models

The authors propose the CoSToM framework, which first uses causal tracing to locate key layers encoding Theory of Mind (ToM) features within LLMs (finding they reside primarily in early layers), and then performs lightweight alignment via activation steering on these layers. This significantly improves the quality of social reasoning in negotiation and persuasion dialogues—shifting the model from "knowing but not applying" to "knowing and applying."

DeCoVec: Building Decoding Space based Task Vector for Large Language Models via In-Context Learning

This paper proposes DeCoVec (Decoding Space based Task Vector), a training-free and non-invasive framework that constructs task vectors in the decoding space by calculating the differences in output logit distributions between few-shot and zero-shot prompts. These vectors are injected into the decoding process to guide generation, achieving an average accuracy improvement of up to 5.50 over standard few-shot baselines on TruthfulQA, Math-500, and AQUA-RAT.

Evaluating Customized vs. Generalist Transformer-based Models for Legal Contract Classification

This paper systematically compares the performance of 13 customized legal Transformer models and 9 general-purpose models on 3 English contract classification tasks. It finds that smaller models with contract-relevant pretraining, such as Legal-BERT and Contracts-BERT, generally outperform larger general-purpose models on long-tail legal labels.

EVE: A Domain-Specific LLM Framework for Earth Intelligence

This paper introduces EVE—the first open-source end-to-end LLM framework for Earth Observation / Earth Sciences led by the ESA \(\Phi\)-lab. It includes EVE-Instruct, a 24B domain-adapted model (based on Mistral Small 3.2 + 10.7B synthetic tokens via interleaved IFT/CPT fine-tuning + 10-checkpoint fusion), the first human-annotated EO evaluation benchmark with 5693 samples, and a RAG + hallucination detection pipeline, which has served 350 users in a 6-month pilot.

Expect the Unexpected? Testing the Surprisal of Salient Entities

This paper investigates the relationship between discourse-level salient entities and surprisal. Using over 70K manually annotated entity mentions and a novel minimal-pair prompting method, the study finds that while global salient entities are themselves more unexpected (higher surprisal), they systematically reduce the surprisal of surrounding content. This effect varies by genre, being strongest in texts with high topic coherence.

FastDiSS: Few-step Match Many-step Diffusion Language Model on Sequence-to-Sequence Generation

This paper identifies two bottlenecks in continuous diffusion language models during few-step sampling: self-conditioning signal mismatch and training saturation. It proposes the FastDiSS framework, which utilizes Self-Conditioning Perturbation (SCP) and Model-Aware Noise Scaling (MANS) to improve robustness, achieving 4×-400× acceleration across six benchmarks while maintaining generation quality.

From Fallback to Frontline: When Can LLMs be Superior Annotators of Human Perspectives?

This paper reformulates "perspective-taking (PT)," a subjective annotation task long considered human-exclusive, as a "statistical estimation problem of the latent group mean \(f^*(x,g)\)." Using a tripartite decomposition of bias, variance, and correlation, it proves that in low-budget, broad-group, or out-group scenarios, LLMs are not merely cheap substitutes but superior estimators compared to in-group human annotators. It further identifies a "reasoning paradox" where enabling reasoning actually degrades performance.

From Static Inference to Dynamic Interaction: A Survey of Streaming Large Language Models

This paper provides the first systematic survey of Streaming Large Language Models (Streaming LLMs). It proposes a unified definition based on data flow and interaction concurrency, categorizing existing methods into a three-level progressive taxonomy: Output-streaming, Sequential-streaming, and Concurrent-streaming, covering methodologies and applications across text, audio, and video modalities.

Generative Floor Plan Design with LLMs via Reinforcement Learning with Verifiable Rewards

The authors converted 80,000 real apartment floor plans from RPLAN into JSON polygon formats and conducted two-stage training (SFT + GRPO with verifiable rewards: connectivity + total-area rewards, with overlap/parsing failures hard-zeroed) using Llama-3.3-70B-Instruct. This enables the LLM to output CAD-ready floor plans that satisfy both bubble diagram topological constraints and numerical area constraints. On the 8-room task, Compatibility decreased by 94% compared to HouseDiffusion (2.5 → 0.15).

Generative Interfaces for Language Models

This paper proposes Generative Interfaces (GenUI), which enables LLMs to move beyond single-box chat responses by generating interactive Web interfaces tailored to specific queries. Using a structured intermediate representation of "interaction flow graphs + finite state machines" and "adaptive reward-driven iterative refinement," GenUI achieves an 84% overall preference win rate against Claude 3.7's chat UI across 100 UIX prompts.

GRASS: Gradient-based Adaptive Layer-wise Importance Sampling for Memory-Efficient LLM Fine-tuning

Ours proposes the GRASS framework, which utilizes Mean Gradient Norm (MGN) as a task-aware and training-phase-aware metric for layer importance. It adaptively samples and updates subsets of model layers for fine-tuning, combined with a layer-wise optimizer state offloading mechanism. This approach achieves an average accuracy improvement of up to 4.38 points while reducing memory usage by up to 19.97%.

Hot-Start from Pixels: Low-Resolution Visual Tokens for Chinese Language Modeling

By rendering Chinese characters into \(8\times 8\) grayscale images and feeding them into a GPT-2 style decoder for next-character prediction, this work achieves final accuracy (39.21%) comparable to index-based baselines (39.10%). Crucially, it doubles the baseline accuracy in early training (at 0.4% data), demonstrating that "visual structure" serves as a natural hot-start prior for Chinese character modeling.

SteerEval: How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities

SteerEval decomposes LLM controllability into L1 (what to express), L2 (how to express), and L3 (specific words) following Marr’s three-level analytical framework. Covering 7,560 paired samples across three domains—Personality, Sentiment, and Language Features—it systematically reveals a critical gap: "existing steering methods generally collapse at a fine-grained level."

How Do Answer Tokens Read Reasoning Traces? Self-Reading Patterns in Thinking LLMs

This paper identifies a "benign self-reading" pattern in reasoning LLMs (such as DeepSeek-R1) during quantitative reasoning. Answer tokens exhibit a forward drift (advancing step-by-step along the reasoning chain) and concentration on semantic anchors (repeatedly revisiting key steps) when attending to reasoning traces, a pattern strongly correlated with correctness. Based on this, a training-free activation steering method driven by SRQ (Self-Reading Quality) is proposed, improving accuracy by up to 2.6% across multiple benchmarks.

Identifying the Periodicity of Information in Natural Language

This paper adapts the AutoPeriod detection algorithm from signal processing to token-surprisal sequences, proposing APS (AutoPeriod of Surprisal) to directly detect information density cycles at a single-document level (e.g., "one cycle every 53 tokens"). It discovers that approximately 11% of human-written documents exhibit strict periodicity, and the periodicity of LLM-generated text is twice as strong as that of humans (30% vs. 14.8%), providing direct evidence for the UID theory and offering explainable features for AI text detection.

Incentives Of EdTech: A Systematic Review Of EduNLP Research

This is the first systematic literature review of EduNLP (Educational Natural Language Processing) focusing on the ACL Anthology. The authors manually annotated 204 papers from the 2024–2025 BEA/NLP4CALL workshops and main conferences across five dimensions: tasks, motives, stakeholder inclusion, incentive structures, and ethical risks. A core tension is identified: research is driven by private sector incentives (e.g., commercial automated grading), while the actual needs of educational infrastructure—especially teachers—are systematically neglected. Teachers are treated as beneficiaries in only 33.3% of papers, real-world deployment accounts for only 9.8%, and ethical engagement often stops at "acknowledgment" rather than "action."

Iterative Formalization and Planning in Partially Observable Environments

The PDDLego+ framework is proposed to enable LLMs to iteratively generate and refine PDDL (Planning Domain Definition Language) representations in partially observable environments. Through a dual-layer error correction loop (solver error + simulation error), it achieves effective planning without the need for fine-tuning or examples.

Leveraging Pretrained Language Models as Energy Functions for Glauber Dynamics Text Diffusion

This paper constructs discrete text diffusion using Glauber dynamics from statistical physics. By treating the pretrained UL2 model as the "energy function/noise distribution" and using mask infilling as the Markov transition kernel, the trained Glauber-UL2 matches the generation perplexity of same-sized GPT-2-M/L AR models for the first time. It outperforms MDLM in search and planning tasks like Sudoku/Zebra and surpasses AR in best-of-N results under iso-compute constraints.

LinkNav: Surfacing Interconnected Information in Scientific Articles

To be supplemented after deeper reading.

Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness

By comparing the ability of self-probing (using the model's own hidden states) and external probing (using hidden states from other models) to predict correctness, this paper identifies "inter-model consensus" as a key confounding factor that masks privileged knowledge. After eliminating consensus, the study reveals domain-specific privileged knowledge: it exists in factual tasks but is absent in mathematical reasoning.

Min-k Sampling: Decoupling Truncation from Temperature Scaling via Relative Logit Dynamics

Min-k Sampling detects the "semantic cliff" (the boundary between high-confidence candidates and low-quality tail noise) by analyzing the local structure of sorted logit distributions. It achieves strict temperature-invariant truncation, maintaining robust reasoning and creative writing quality even at extreme temperatures.

Mind the Gap: How Elicitation Protocols Shape the Stated-Revealed Preference Gap in Language Models

The authors upgrade forced-choice to expanded-choice (allowing "Equal Preference" and "Depends" as neutral options) within the LitmusValues / AIRiskDilemmas framework. Systematically evaluating 24 LLMs, they find that allowing neutrality on the stated side significantly increases the SvR Spearman correlation \(\rho\) from ~0.2 to ~0.7 (by filtering out weak signals where models lack an inherent stance). Conversely, allowing neutrality on the revealed side collapses \(\rho\) toward zero or even negative values (as many models select "Depends/Equal" almost exclusively in contextualized scenarios). They also verify that system prompt steering based on stated ranking is generally unreliable for a large set of 16 values. The conclusion is that the SvR gap depends heavily on the elicitation protocol, and preference evaluation must explicitly model the state of "having no opinion."

Model-Agnostic Meta Learning for Class Imbalance Adaptation

This paper proposes HAMR (Hardness-Aware Meta-Resample), a unified meta-learning framework. It dynamically estimates instance-level weights through bi-level optimization to prioritize truly difficult samples, combined with a neighborhood-aware resampling mechanism to focus training on difficult instances and their semantic neighbors. It consistently outperforms strong baselines across 6 imbalanced NLP datasets.

MoRI: Learning Motivation-Grounded Reasoning for Scientific Ideation in Large Language Models

Scientific ideation is explicitly modeled as a two-stage conditional reasoning task: "context → motivation → reasoning → method." Based on SFT cold-starting, a 14B model is trained using GRPO and a pair of novel verifiable rewards (Entropy-Aware Information Gain, EAIG and Contrastive Semantic Gain, CSG). The model outperforms agentic frameworks such as GPT-4o, Claude-3.5-Sonnet, and AI-Scientist-V2 on held-out test sets from ICLR/NeurIPS.

MulDimIF: A Multi-Dimensional Constraint Framework for Evaluating and Improving Instruction Following in Large Language Models

The authors propose MulDimIF, a multi-dimensional constraint framework that systematically evaluates the instruction-following capabilities of LLMs across three dimensions: constraint patterns (3 types), constraint categories (4 categories, 13 subcategories), and constraint difficulty (4 levels). Model performance is significantly improved via GRPO training, with findings indicating that improvements primarily stem from parameter updates in the attention modules.

Not All Animals Are Equal: Metaphorical Framing through Source Domains and Semantic Frames

This paper proposes ConceptFrameMet, the first computational framework combining FrameNet semantic frames and Conceptual Metaphor Theory (CMT) source domains. Using a RoBERTa-based multi-task model, it detects metaphors and predicts their semantic frames and source domains. Combined with log-likelihood ratio (LLR) statistics to identify salient metaphorical patterns in discourse, the study reveals that liberals and conservatives use the same source domains in immigration discourse but choose different semantic frames to convey drastically different associations.

Nürnberg NLP at PsyDefDetect: Multi-Axis Voter Ensembles for Psychological Defence Mechanism Classification

This BioNLP 2026 PsyDefDetect shared task system paper treats psychological defense mechanism classification as a problem with fuzzy boundaries and limited annotation consistency. It utilizes an ensemble of 9 voters across different granularities, training paradigms, and base models, achieving an F1=.420 on the hidden test set and ranking 1st among 21 registered teams.

One Persona, Many Cues, Different Results: How Sociodemographic Cues Impact LLM Personalization

This paper systematically compares the effects of 6 common persona prompting methods (two variants each for name, explicit mention, and dialogue history) across 7 LLMs and 4 tasks. The study find that while average responses are highly correlated across prompting methods, the differences between personas generated by different methods vary significantly. Overly explicit prompts lead to stronger personalization bias, suggesting that bias conclusions should not be drawn based on a single prompting method.

Overcoming Copyright Barriers in Corpus Distribution Through Non-Reversible Hashing

This paper proposes novelshare: it transforms tokens of copyrighted text into truncated, non-reversible hashes and publishes only the hash sequences along with the researcher's own annotations. This allows users who legally possess the original text to re-align the annotations under slight version differences, achieving a token alignment accuracy of 98.7% to 99.79% on close-edition novels.

PersonaArena: Dynamic Simulation for Evaluating and Enhancing Persona-Level Role-Playing in Large Language Models

PersonaArena utilizes user-generated content to construct 1,000 fine-grained personas and evaluates and enhances the persona-level role-playing capabilities of LLMs through dynamic social simulations and multi-judge debates.

Prefix Parsing is Just Parsing

This paper proposes prefix grammar transformation, an efficient method that reduces prefix parsing to ordinary parsing. Given a grammar, it constructs a new grammar that generates exactly the set of all prefix strings of the original grammar, allowing any existing ordinary parsing algorithm to be reused without modification.

Repeated Sequences Reveal Gaps between Large Language Models and Natural Language

This paper proposes an evaluation framework based on repeated subsequence distributions, characterizing the entropy growth behavior of text through high-order Rényi entropy. It finds that natural language exhibits a stable sub-linear entropy growth pattern, while the entropy indices of GPT-generated text increase monotonically with model scale, revealing systematic differences in long-range statistical organization between LLMs and natural language.

Rethinking the Idiomaticity Decomposability Hypothesis: Evidence from Distributional Learning

This paper re-examines the Idiom Decomposability Hypothesis using contextualized language models as "controlled distributional learners." It finds that model-derived decomposability is only weakly correlated with human judgments and exhibits a small but stable negative correlation with syntactic flexibility, suggesting that idiomatic behavior is better explained by distributional experience, surprisal, and representational stabilization.

SEF-CLGC at SemEval-2026 Task 11: Logical Notation Impact on Language Model Performance

This SemEval-2026 Task 11 system paper translates natural language syllogisms into various formal logic notations (FOL, CLIF, CLINGO, etc.) and performs supervised fine-tuning (SFT) on Small Language Models (SLMs) with <1B parameters (Flan-T5). It demonstrates that pairing natural language with "pre-trained" formal notations like FOL significantly reduces content bias during reasoning while maintaining extremely low computational requirements.

Solver-Independent Automated Problem Formulation via LLMs for High-Cost Simulation-Driven Design

This paper proposes APF (Automated Problem Formulation), a solver-independent framework that utilizes LLMs to transform natural language design requirements from engineers into executable mathematical optimization models. By employing an innovative data generation and test instance annotation pipeline, it overcomes the difficulty of filtering data without solver feedback in high-cost simulation scenarios, significantly outperforming existing methods in antenna design tasks.

Synthetic Eggs in Many Baskets: The Impact of Synthetic Data Diversity on LLM Fine-Tuning

This paper systematically compares the impact of single-source models, multi-source models, and human data as sources for supervised fine-tuning (SFT) on Llama models. It finds that multi-source synthetic data mitigates distributional collapse and self-preference, but synthetic data may weaken safety guardrails while maintaining output quality, with risks varying in complex ways based on source model scale and mixing methods.

Text-to-Distribution Prediction with Quantile Tokens and Neighbor Context

This paper proposes the Quantile Token Regression method. By inserting specialized quantile tokens into the input sequence and combining retrieved neighbor instances with their empirical distributions, the LLM can predict a complete conditional distribution rather than a single point estimate. This approach reduces the avg MAPE by approximately 4 points and narrows prediction intervals by more than 2x compared to baselines on Airbnb and StackSample datasets.

Think in Sentences: Explicit Sentence Boundaries Enhance Language Model's Capabilities

This paper proposes inserting delimiter tokens at sentence boundaries within LLM inputs to implement a "sentence-by-sentence" reasoning paradigm via ICL and SFT. Constant improvements were achieved across models from 7B to 600B (GSM8k +7.7%, DROP +12.5%) with almost no additional computational overhead.

TingIS: Real-time Risk Event Discovery from Noisy Customer Incidents at Enterprise Scale

TingIS is an end-to-end risk event discovery system deployed on a FinTech platform. Through a five-module architecture (semantic distillation, cascaded routing, event linking engine, state management, and multi-dimensional denoising), it extracts actionable risk events from massive, noisy customer complaints in real-time, achieving a P90 alert latency of 3.5 minutes and a 95% discovery rate for high-priority incidents.

OOD Proxy Demonstration Retrieval Scheme for Robust In-Context Learning

By constructing dual proxies for the source and target domains and calculating their perplexity difference as an OOD score, combined with Mahalanobis distance constraints for demonstration diversity, this method accurately filters demonstrations from the source domain aligned with target distributions under conditions where target samples are inaccessible, thereby enhancing the robustness of LLM In-Context Learning.

UCS: Estimating Unseen Coverage for Improved In-Context Learning

This paper proposes UCS (Unseen Coverage Selection), a training-free subset-level coverage prior based on the Smoothed Good-Turing estimator. By estimating the number of unobserved potential clusters in the candidate exemplar set, it regularizes existing ICL exemplar selection methods, improving accuracy by 2-6% on intent classification and reasoning tasks.

Understanding Structured Financial Data with LLMs: A Case Study on Fraud Detection

This paper proposes FinFRE-RAG, a two-stage framework that serializes high-dimensional tabular transaction data into natural language via importance-guided feature dimensionality reduction. By combining label-aware retrieval-augmented in-context learning, it significantly improves the F1/MCC of open-source LLMs in financial fraud detection, narrowing the performance gap with specialized tabular classifiers.

Unlocking the Potential of Diffusion Language Models through Template Infilling

This paper proposes Template Infilling, which transforms the generation constraints of Diffusion Language Models (DLMs) from a single prefix into structural anchors distributed throughout the output. By utilizing dynamic span allocation to reserve space for complex reasoning, the method significantly stabilizes and enhances parallel generation quality in mathematical reasoning, code generation, and global planning tasks.

VISTA: Verification In Sequential Turn-based Assessment

VISTA proposes a multi-turn dialogue factuality evaluation framework based on claim-level decomposition and sequential consistency tracking. It subdivides unverifiable content into four categories: subjective, contradicted, lacking evidence, and abstention, significantly outperforming FActScore and LLM-as-Judge baselines across four dialogue benchmarks and eight LLMs.

VOYAGER: A Training Free Approach for Generating Diverse Datasets using LLMs

Voyager is a training-free LLM data generation algorithm that maintains diverse anchors and explorers using DPP, while iteratively rewriting prompts using textual gradients. It significantly improves Vendi diversity in creative writing and reasoning data generation with minimal sacrifice to quality.

Wait, There’s a Way Out: A Decision Mechanism for Dialogue Derailment Prediction

The paper decouples "belief estimation" from "trigger decision" in dialogue derailment prediction. By using forward simulation to identify recoverable tense moments, the authors achieve a significant reduction in false alarm rates (from 36.2% to 26.7%) without sacrificing overall accuracy.

When Gradients Collide: Failure Modes of Multi-Objective Prompt Optimization for LLM Judges

This paper systematically investigates the failure modes of textual gradient methods when simultaneously optimizing prompts for multiple evaluation criteria. It identifies gradient dilution and instruction interference as two key bottlenecks that prevent multi-objective optimization from significantly improving upon initial prompts.

When TableQA Meets Noise: A Dual Denoising Framework for Complex Questions and Large Tables

By decomposing semantic units in questions and constructing evidence trees for transparent table pruning, the EnoTab framework achieves significant performance gains when processing complex questions and ultra-large tables, effectively mitigating the negative impact of noisy data on reasoning through a dual denoising mechanism.

Why Did Apple Fall: Evaluating Curiosity in Large Language Models

This work proposes the first psychologically inspired framework to systematically evaluate curiosity behaviors in LLMs. By combining questionnaire self-reports with behavioral experiments, it finds that LLMs exhibit curiosity-like behavioral patterns rather than being an intrinsic trait. Furthermore, a curiosity-driven questioning pipeline is designed, proving that simulating curious behavior can enhance downstream reasoning performance.