⚡ LLM Efficiency¶

💬 ACL2026 · 8 paper notes

BOSCH: Black-Box Binary Optimization for Short-Context Attention-Head Selection in LLMs: BOSCH is a training-free head-level SWA mixing method that models SWA head selection as a large neighborhood search problem decomposed into three stages (layer importance probing → adaptive ratio allocation → grouped head selection), systematically outperforming layer-level heuristics and 6 static head-level methods across 4 models and 4 ratio settings.
HumanLLM: Benchmarking and Improving LLM Anthropomorphism via Human Cognitive Patterns: HumanLLM models 244 psychological patterns (100 personality traits + 144 social cognitive patterns) as interacting causal forces rather than isolated labels, constructs 11,359 multi-pattern interaction scenarios, achieves \(r=0.90\) human alignment through dual-layer checklist evaluation, and HumanLLM-8B surpasses Qwen3-32B in multi-pattern dynamics at 4x fewer parameters.
Multi-Drafter Speculative Decoding with Alignment Feedback: MetaSD is a unified framework integrating multiple heterogeneous drafters into speculative decoding, modeling drafter selection as a multi-armed bandit problem with Block Divergence (BD) reward signals to dynamically select the most aligned drafter, consistently outperforming single-drafter methods in both black-box and white-box configurations.
Native Hybrid Attention for Efficient Sequence Modeling: Native Hybrid Attention (NHA) concatenates linear RNN long-term memory slots with sliding window short-term precise tokens and processes them through a single softmax attention, achieving native intra-layer and inter-layer hybridization — dynamically allocating long-short attention weights without extra fusion parameters, outperforming Transformer and other hybrid baselines on recall-intensive and commonsense reasoning tasks.
Planning Beyond Text: Graph-based Reasoning for Complex Narrative Generation: PLOTTER shifts narrative planning from text representation to graph structure (event graph + character graph), diagnosing and repairing narrative flaws through multi-agent Evaluate-Plan-Revise iterative cycles on graph topology, significantly outperforming existing methods on narrativity, characterization, and dramatic tension.
SciCoQA: Quality Assurance for Scientific Paper–Code Alignment: SciCoQA is the first benchmark for detecting semantic discrepancies between scientific papers and their code implementations, containing 635 discrepancy instances (92 real + 543 synthetic). Evaluation of 22 LLMs reveals the strongest model detects only 46.7% of real discrepancies, uncovering a critical capability gap in automated scientific quality assurance.
SpecBound: Adaptive Bounded Self-Speculation with Layer-wise Confidence Calibration: SpecBound suppresses shallow-layer false high-confidence predictions via layer-wise temperature annealing and designs a bounded speculation algorithm to adaptively control draft depth and width, achieving up to 2.33x inference acceleration while maintaining lossless output.
Speculative Verification: Exploiting Information Gain to Refine Speculative Decoding: Speculative Verification (SV) introduces a companion model of equal size to the drafter, using draft-companion distribution similarity \(S\) and companion acceptance probability \(A\) to predict target model acceptance probability, dynamically selecting optimal verification length to maximize goodput, achieving average 1.4x and up to 1.9x speedup over standard speculative decoding in large-batch inference.