Skip to content

🔬 Interpretability

🔬 ICLR2026 · 59 paper notes

A Cortically Inspired Architecture for Modular Perceptual AI

This paper proposes a cortically inspired modular perceptual AI architecture blueprint grounded in neuroscience, comprising four components — dedicated encoders, a shared cross-modal latent space, a routing controller, and a recursive predictive feedback loop — and validates through sparse autoencoder experiments that modular decomposition improves within-domain feature stability (+15.4pp Jaccard overlap).

ActivationReasoning: Logical Reasoning in Latent Activation Spaces

This paper proposes the ActivationReasoning (AR) framework, which embeds explicit logical reasoning into the latent activation space of LLMs (via SAE-extracted features) through a three-stage pipeline: discovering concept representations → detecting activated propositions → reasoning with logical rules. The framework supports multi-hop reasoning, concept composition, and safety control, achieving 95%+ accuracy on PrOntoQA with an 8B model, surpassing GPT-4o.

Auditing Cascading Risks in Multi-Agent Systems via Semantic–Geometric Co-evolution

This paper proposes SCCAL, a framework that models semantic–geometric co-evolution in multi-agent systems (MAS) by coupling semantic flow with the Ollivier–Ricci curvature (ORC) of interaction graphs. The joint prediction residual between the two modalities serves as an early warning signal for cascading risks, enabling anomaly detection several rounds before semantic violations become observable.

Behavior Learning (BL): Learning Hierarchical Optimization Structures from Data

Inspired by the utility maximization paradigm in behavioral science, this paper proposes the Behavior Learning (BL) framework, which models data as a Gibbs distribution induced by a hierarchical composition of interpretable, modular utility maximization problems (UMPs), achieving a unified balance among predictive performance, intrinsic interpretability, and parameter identifiability.

Beyond Linear Probes: Dynamic Safety Monitoring for Language Models

This paper proposes the Truncated Polynomial Classifier (TPC), which enables dynamic safety monitoring by training a polynomial over LLM activation spaces order-by-order and evaluating via truncation at inference time. Low-order truncations (≈ linear probes) handle easy inputs quickly, while higher-order terms provide stronger protection for difficult inputs. TPC matches or outperforms MLP baselines on WildGuardMix and BeaverTails while offering built-in interpretability.

Closing the Curvature Gap: Full Transformer Hessians and Their Implications for Scaling Laws

This paper presents the first explicit Hessian expressions and spectral norm upper bounds for a complete Transformer block (including LayerNorm and FFN), and establishes a theoretical framework showing that the loss landscape converges at an \(O(1/k)\) rate as data volume increases, providing a mathematical foundation for scaling laws and curvature-aware training.

Concepts' Information Bottleneck Models

This paper introduces Information Bottleneck (IB) regularization into the concept layer of Concept Bottleneck Models (CBMs), learning minimal sufficient concept representations by penalizing \(I(X;C)\) while preserving \(I(C;Y)\). The approach consistently improves both predictive performance and concept intervention reliability across six CBM variants and three benchmarks.

Cross-Modal Redundancy and the Geometry of Vision-Language Embeddings

This paper proposes the Iso-Energy hypothesis — that concepts genuinely shared across modalities should exhibit equal average activation energy in each modality — and introduces Aligned SAE as an analytical tool to reveal the geometric structure of VLM embedding spaces, where bimodal atoms carry cross-modal alignment signals and unimodal atoms fully account for the modality gap.

Decomposing Representation Space into Interpretable Subspaces with Unsupervised Learning

This paper proposes NDM (Neighbor Distance Minimization), an unsupervised method that discovers interpretable, non-basis-aligned subspaces in neural network representation spaces by minimizing intra-subspace neighbor distances. On GPT-2, it achieves an average Gini coefficient of 0.71 (indicating highly concentrated information); on Qwen2.5-1.5B, it identifies separated subspaces for routing parametric knowledge versus in-context knowledge.

Decoupling Dynamical Richness from Representation Learning: Towards Practical Measurement

This paper proposes a computationally efficient, performance-agnostic measure of dynamical richness, \(\mathcal{D}_{LR}\), which quantifies rich/lazy training dynamics by comparing activations before and after the last layer, and demonstrates that neural collapse is a special case of this measure.

Dynamic Reflections: Probing Video Representations with Text Alignment

This paper is the first to extend the Platonic Representation Hypothesis (PRH) from static image–text to the temporal video–text domain. Through systematic evaluation of 121 visual and language models, it reveals that increasing the number of frames and captions at test time can nearly double alignment scores, and proposes a saturating scaling law with \(R^2 > 0.98\) to quantify this behavior.

Dynamic Reflections: Probing Video Representations with Text-Driven Reasoning

This work is the first to extend the Platonic Representation Hypothesis (PRH) to the temporal domain, systematically studying video–text representation alignment. It finds that increasing the number of frames and captions at test time can substantially improve alignment scores (up to doubling), and proposes a precise parameterized test-time scaling law.

Evolution of Concepts in Language Model Pre-Training

This paper is the first to apply crosscoders (cross-snapshot sparse dictionary learning) to track the emergence and evolution of features during language model pre-training. It identifies a two-phase transition from "statistical learning → feature learning" and causally links micro-level feature evolution to macro-level downstream task metrics through attribution analysis.

Exploring Interpretability for Visual Prompt Tuning with Cross-layer Concepts

This paper proposes IVPT (Interpretable Visual Prompt Tuning), which associates abstract visual prompts with human-understandable semantic regions via cross-layer class-agnostic concept prototypes. IVPT is the first method to achieve interpretability for visual prompts while preserving the advantages of parameter-efficient fine-tuning, simultaneously improving explanation consistency (+8.4%) and classification accuracy on fine-grained benchmarks such as CUB-200.

ExPO-HM: Learning to Explain-then-Detect for Hateful Meme Detection

ExPO-HM is proposed, inspired by the training pipeline of human content moderators. By combining policy manual SFT warm-up, GRPO curriculum learning, and a Conditional Decision Entropy (CDE) reward, it is the first Explain-then-Detect system to comprehensively surpass direct detection baselines across binary classification, fine-grained classification, and reasoning quality in hateful meme detection, achieving up to 15–17% F1 improvement.

Formal Mechanistic Interpretability: Automated Circuit Discovery with Provable Guarantees

This paper introduces neural network (NN) verification into mechanistic interpretability, proposing the first circuit discovery framework with provable guarantees: input robustness guarantees circuit faithfulness over continuous input domains, patching robustness guarantees circuit consistency over continuous patching domains, and a four-level minimality hierarchy (quasi → local → subset → cardinal) is formalized. A monotonicity theory unifies all three types of guarantees.

GAVEL: Towards Rule-Based Safety through Activation Monitoring

Inspired by the Snort/YARA ruleset paradigm from cybersecurity, this paper proposes decomposing LLM internal activations into 23 fine-grained "Cognitive Elements" (CEs), which are then composed via Boolean logic into auditable safety rules. On Mistral-7B, the approach achieves an average AUC of 0.99 and FPR of 0.004 across 9 misuse categories with less than 1% inference overhead, while naturally supporting cross-lingual and cross-model transfer.

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

This paper proposes GEPA (Genetic-Pareto), a prompt optimizer that diagnoses failure modes from a small number of execution trajectories via natural language reflection and iteratively refines prompts. GEPA outperforms GRPO by an average of 6% (up to 20%) across six tasks while using only 1/35 of the sampling budget.

Grokking in LLM Pretraining? Monitor Memorization-to-Generalization without Test

This work is the first to validate the grokking phenomenon in near-single-epoch pretraining of a real-scale LLM (7B MoE)—different data groups exhibit asynchronous memorization and delayed generalization. By analyzing the evolution of MoE routing pathways (from instance-specific to structured/shared), two zero-cost metrics are proposed to monitor generalization progress without requiring instruction tuning or benchmark evaluation.

Hallucination Begins Where Saliency Drops

This paper proposes LVLMs-Saliency, a gradient-aware diagnostic framework that quantifies the visual grounding strength of each output token. It identifies a key finding: hallucinations arise when the saliency of previously generated tokens toward the next token prediction drops. Building on this insight, the paper introduces a dual-mechanism inference-time framework combining SGRS (Saliency-Guided Rejection Sampling) and LocoRE (Local Coherence Reinforcement), achieving significant hallucination reduction across multiple LVLMs.

Hidden Breakthroughs in Language Model Training

This paper proposes POLCA (Projection Oriented Loss Change Allocation)—a method that decomposes per-sample loss changes along any orthogonal basis within a low-rank training subspace—to reveal numerous hidden conceptual breakthroughs from seemingly smooth training loss curves. The approach inverts the paradigm of training interpretability from "define skills first, then observe" to "decompose first, then discover skills automatically."

How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Understanding

By analyzing the leading terms of training gradients, this paper derives closed-form expressions for each Transformer weight matrix during the early training phase. Each matrix decomposes into a simple combination of three basis functions (bigram, token-interchangeability, and context mapping), revealing how Transformers learn semantic associations such as "bird"↔"flew" from natural language data. The theoretical predictions align closely with the weights learned by real LLMs.

Implicit Statistical Inference in Transformers: Approximating Likelihood-Ratio Tests In-Context

From a statistical decision theory perspective, this paper proves that Transformers can approximate the sufficient statistic of the Bayes-optimal likelihood-ratio test during in-context learning, and through mechanistic analysis reveals that models employ adaptive circuits of different depths for linear versus nonlinear tasks.

Information Shapes Koopman Representation

This paper revisits the problem of finite-dimensional Koopman operator representation learning from the perspective of the Information Bottleneck (IB) framework. The Koopman operator lifts nonlinear dynamical systems into infinite-dimensional linear evolution, yet practical applications require approximation within finite-dimensional subspaces, giving rise to a fundamental tension between compactness and expressiveness. The authors prove that (1) latent mutual information controls an upper bound on prediction error, but excessive maximization leads to mode collapse; and (2) von Neumann entropy prevents collapse and preserves effective dimensionality. Building on these results, an information-theoretic Lagrangian formulation is proposed that jointly balances three objectives—temporal coherence, predictive sufficiency, and structural consistency—and yields a tractable loss function. The method outperforms existing Koopman approaches on three categories of tasks: physics simulation, visual control, and graph-structured dynamics.

Initialization Schemes for Kolmogorov-Arnold Networks: An Empirical Study

This work presents the first systematic study of initialization strategies for spline-based KANs. It proposes variance-preserving schemes inspired by LeCun/Glorot and a tunable power-law initialization family. Large-scale experiments spanning 126K+ model instances demonstrate that power-law initialization consistently outperforms baselines on function fitting and PDE solving, while the Glorot scheme yields significant gains for larger models. NTK eigenspectrum analysis further reveals the underlying optimization dynamics.

Internal Planning in Language Models: Characterizing Horizon and Branch Awareness

This paper proposes an information-theoretic framework based on VQ-VAE to analyze internal planning behavior in language models, finding that planning horizon is task-dependent, that models implicitly retain information about unchosen correct paths, and that next-token decisions rely primarily on the most recent computations.

Layer by layer, module by module: Choose both for optimal OOD probing of ViT

This work systematically investigates the intermediate-layer behavior of pretrained ViTs through large-scale linear probing experiments. It finds that distribution shift is the primary cause of performance degradation in deeper layers, and reveals at the module level that the optimal probing point depends on the degree of shift: probing FFN activations is optimal under significant shift, while probing MHSA post-normalization outputs is optimal under mild shift.

LORE: Jointly Learning the Intrinsic Dimensionality and Relative Similarity Structure from Ordinal Data

This paper proposes LORE — the first framework to jointly learn embeddings and intrinsic dimensionality from ordinal triplet comparisons. It replaces the conventional fixed-dimension strategy with a non-convex Schatten-p quasi-norm regularizer (\(p < 1\)), solved via an iteratively reweighted nuclear norm (IRNN) algorithm with guaranteed convergence to a stationary point. Evaluated on synthetic data, LLM-simulated perceptual experiments, and three crowdsourced datasets, LORE substantially outperforms all baselines in dimensionality recovery while maintaining high triplet accuracy and semantic interpretability.

MATA: A Trainable Hierarchical Automaton System for Multi-Agent Visual Reasoning

This paper proposes MATA (Multi-Agent hierarchical Trainable Automaton), which formulates multi-agent visual reasoning as a hierarchical finite-state automaton. The top-level state transitions are learned by a trainable hyper agent (an LLM-based state controller), while each agent internally employs a rule-based sub-automaton. Collaboration and competition are realized through shared memory. MATA achieves state-of-the-art performance on multiple visual reasoning benchmarks.

Modal Logical Neural Networks for Financial AI

This paper proposes Modal Logical Neural Networks (MLNN), which integrate Kripke semantics (necessity/possibility modal operators) into neural networks, achieving auditable logical reasoning combined with deep learning performance for financial contract safety review, wash-sale compliance, and market collusion detection.

Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

This paper demonstrates that narrow finetuning leaves clearly readable traces in LLM activations: even over the first few tokens of unrelated text, the activation differences between pre- and post-finetuning models encode rich semantic information about the finetuning objective. Using the proposed Activation Difference Lens (ADL) method, an interpretability agent achieves a 91% success rate in identifying finetuning objectives, more than twice the performance of black-box baselines.

NIMO: a Nonlinear Interpretable MOdel

NIMO proposes a hybrid model \(y = \sum_j x_j \beta_j (1 + g_{\mathbf{u}_j}(\mathbf{x}_{-j}))\) that preserves the global interpretability of linear regression coefficients (via mean marginal effects, MEM) while leveraging neural networks to provide instance-wise nonlinear corrections. Linear coefficients and network parameters are jointly optimized efficiently through parameter elimination.

Noise Stability of Transformer Models

This paper proposes noise stability as a superior alternative to average sensitivity for measuring simplicity bias in Transformers, and designs a regularization method based on this metric that accelerates training by approximately 35% on synthetic tasks and 75% on language modeling.

One Language, Two Scripts: Probing Script-Invariance in LLM Concept Representations

Using the Serbian digraphic system (Latin/Cyrillic) as a natural controlled experiment, this paper investigates whether features learned by Sparse Autoencoders (SAEs) capture abstract semantics beyond surface-level tokenization. The study finds that identical sentences across scripts activate highly overlapping SAE features (Jaccard ≈ 0.58), that script switching induces smaller representational differences than same-script paraphrasing, and that this invariance strengthens with model scale — demonstrating that SAE features genuinely capture semantic structure beyond orthography.

PolySHAP: Extending KernelSHAP with Interaction-Informed Polynomial Regression

This paper proposes PolySHAP, which extends KernelSHAP's linear approximation to higher-order polynomial regression to capture nonlinear feature interactions, thereby improving the estimation accuracy of Shapley values. The paper further provides a theoretical proof that paired sampling is equivalent to second-order PolySHAP, offering the first rigorous explanation for the superior performance of this widely used heuristic.

PoSh: Using Scene Graphs to Guide LLMs-as-a-Judge for Detailed Image Descriptions

This paper proposes PoSh, an evaluation metric that extracts scene graphs \(G(d) = \langle O(d), E(d), K(d) \rangle\) from both generated and reference descriptions as structured rubrics, guiding an open-source 14B LLM (Qwen3-14B) to perform QA-based fine-grained error localization. PoSh surpasses GPT-4o-as-Judge by +0.05 Spearman ρ on the DOCENT artwork benchmark and CapArena, while remaining fully reproducible.

Provably Explaining Neural Additive Models

This paper proposes a dedicated efficient explanation algorithm for Neural Additive Models (NAMs) that generates provably cardinally-minimal explanations using only a logarithmic number of verification queries, outperforming existing general-purpose subset-minimal explanation algorithms in both speed and explanation quality.

RADAR: Reasoning-Ability and Difficulty-Aware Routing for Reasoning LLMs

This paper proposes RADAR, a framework that formulates adaptive inference for reasoning language models (RLMs) as a multi-objective optimization problem. It leverages Item Response Theory (IRT) to jointly estimate interpretable query difficulty and model configuration ability parameters, enabling lightweight and scalable query-level routing. RADAR outperforms state-of-the-art routing methods on 8 reasoning benchmarks while adding only approximately 7ms of latency.

SALVE: Sparse Autoencoder-Latent Vector Editing for Mechanistic Control of Neural Networks

This paper proposes SALVE, a three-stage "Discover–Verify–Control" framework: (1) an L1-regularized sparse autoencoder (SAE) is trained to discover interpretable feature bases within a model; (2) Grad-FAM visualization is employed to verify the semantic meaning of discovered features; (3) the SAE decoder matrix guides permanent weight-space editing. The framework is validated on ResNet-18 and ViT-B/16, demonstrating precise, persistent, and low-side-effect control ranging from class suppression to cross-class feature modulation.

SEED-SET: Scalable Evolving Experimental Design for System-level Ethical Testing

This paper proposes SEED-SET, a framework that formulates ethical evaluation of autonomous systems as a hierarchical Bayesian experimental design problem, jointly integrating objective metrics and subjective value judgments to efficiently generate test cases with high ethical alignment under limited evaluation budgets.

Semantic Regexes: Auto-Interpreting LLM Features with a Structured Language

This paper proposes semantic regexes—a structured language for automatically describing LLM features—using three primitives (symbol/lexeme/field) and three modifier types (context/composition/quantification). The approach achieves accuracy on par with natural language descriptions while producing more concise and consistent feature descriptions, and enables quantitative analysis of how feature complexity evolves across layers.

Semantic Regexes: Auto-Interpreting LLM Features with a Structured Language

This paper proposes Semantic Regexes, a structured language for automatically describing LLM features. By combining primitives (symbol / lexeme / field) with modifiers (context / composition / quantification), it produces feature descriptions that are equally accurate to natural language, yet more concise, consistent, and amenable to programmatic analysis.

Stretching Beyond the Obvious: A Gradient-Free Framework to Unveil the Hidden Landscape of Visual Invariance

This paper proposes the Stretch-and-Squeeze (SnS) algorithm, a gradient-free, model-agnostic bi-objective optimization framework that systematically probes the invariance manifold of visual systems by "stretching" representations at different processing levels while "squeezing" the activation of target units. SnS reveals hierarchical differences in invariance interpretability between standard and robust CNNs.

STRIDE: Subset-Free Functional Decomposition for XAI in Tabular Settings

STRIDE reformulates model explanation as an orthogonal functional decomposition problem in RKHS. By recursively centering kernel functions, it analytically computes orthogonal functional components \(f_S(x_S)\) without enumerating \(2^d\) subsets. The method not only produces scalar importance scores but also reveals how features synergistically or redundantly influence predictions, achieving 3× speedup over TreeSHAP with \(R^2 = 0.93\) on tabular data.

Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability

This paper proposes Temporal SAEs (T-SAEs), which introduce a temporal contrastive loss to encourage high-level features to maintain consistent activations across adjacent tokens. Through self-supervised training without explicit semantic supervision, T-SAEs achieve disentanglement of semantic and syntactic features, recovering smoother and more coherent semantic concepts without sacrificing reconstruction quality.

The Geometry of Reasoning: Flowing Logics in Representation Space

This paper proposes a geometric framework that models the reasoning process of LLMs as "flows" (embedding trajectories) in representation space. Through controlled experiments that decouple logical structure from semantic content, it demonstrates that LLMs internalize logical invariants beyond surface form, and identifies potentially universal representation regularities across model families.

Position: The Reasoning Trap — Logical Reasoning as a Mechanistic Pathway to Advanced AI Self-Awareness

This paper proposes the RAISE framework, arguing that improvements in logical reasoning capabilities (deductive, inductive, and abductive) constitute a mechanistic pathway to AI situational awareness, and that advances in reasoning inevitably amplify the dangerous preconditions for situational awareness.

The Reasoning Trap — Logical Reasoning as a Mechanistic Pathway to Situational Awareness

A position paper proposing the RAISE (Reasoning Advancing Into Self Examination) framework, which systematically argues that three improvement pathways for logical reasoning (deductive/inductive/abductive) will inevitably endow LLMs with situational awareness. The paper constructs a five-level escalation ladder from basic self-identification to strategic deception, and demonstrates that current safety mechanisms such as RLHF and Constitutional AI are insufficient to arrest this trend.

There Was Never a Bottleneck in Concept Bottleneck Models

This paper identifies that Concept Bottleneck Models (CBMs) do not enforce a true "bottleneck" — the fact that a representation variable \(z_j\) can predict concept \(c_j\) does not imply it encodes only the information of \(c_j\). The paper proposes MCBM (Minimal Concept Bottleneck Model), which applies information bottleneck regularization to constrain each \(z_j\) to retain only the information of its corresponding concept, thereby achieving genuinely disentangled representations and reliable concept interventions.

Tokenizing Single-Channel EEG with Time-Frequency Motif Learning

This paper proposes TFM-Tokenizer, the first framework that learns a time-frequency motif vocabulary from single-channel EEG and encodes it into discrete tokens. It consistently improves performance on tasks such as event classification and seizure detection, and can serve as a plug-and-play component to enhance existing EEG foundation models.

TokenSeek: Memory Efficient Fine Tuning via Instance-Aware Token Ditching

This paper proposes TokenSeek, a general-purpose memory optimization plugin for Transformer fine-tuning. By combining contextual attention information with gradient information for instance-level token importance estimation, TokenSeek retains only the top 10% high-value tokens for gradient updates, achieving up to 65.7% memory savings while matching or surpassing full-token fine-tuning performance.

Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer

Through controlled experiments and mechanistic analysis, this paper reveals the nature of subliminal learning: hidden preferences of teacher models are transferred to student models via a small number of "divergence tokens," with early layers playing a critical role. The phenomenon is also shown to be fragile and can be suppressed by simple paraphrasing.

Uncovering Grounding IDs: How External Cues Shape Multimodal Binding

This paper employs mechanistic interpretability tools to reveal the internal mechanism by which external visual cues (symbols + dividing lines) improve reasoning in LVLMs. Under structured inputs, the model spontaneously produces "Grounding IDs"—latent identifiers that bind visual regions to symbolic anchors. Causal activation swap experiments (swap accuracy = 0.98) demonstrate that this binding causally drives model predictions. Furthermore, the mechanism reduces Qwen2.5-VL's CHAIRs hallucination rate from 32.4% to 27.2% on MS-COCO, and generalizes to closed-source models such as GPT-4o.

Uni-NTFM: A Unified Foundation Model for EEG Signal Representation Learning

Uni-NTFM is grounded in first-principles neuroscience. It introduces a Heterogeneous Feature Projection Module (HFPM) for decoupled time-frequency encoding, a hierarchical Topological Embedding (TE) for unifying heterogeneous electrode configurations, and an MoE Transformer for functional modularity and sparse coding. A 1.9B-parameter model is pretrained on approximately 28,000 hours of EEG data, achieving state-of-the-art performance on 9 downstream tasks under both linear probing and fine-tuning protocols.

Universal Properties of Activation Sparsity in Modern Large Language Models

This paper presents a systematic study of activation sparsity in modern LLMs (GLU architecture + SiLU/GELU), proposes a universal top-p sparsification framework and a critical sparsity metric, demonstrates that activation sparsity increases monotonically with model scale, identifies input sparsification as the most practical training-free acceleration scheme, and provides the first empirical evidence that diffusion-based LLMs also exhibit significant activation sparsity.

VCWorld: A Biological World Model for Virtual Cell Simulation

This paper proposes VCWorld, a cell-level white-box simulator that integrates structured biological knowledge graphs with the iterative reasoning capabilities of large language models (LLMs) to simulate drug perturbation-induced signaling cascades in a data-efficient manner. The framework generates interpretable step-by-step predictions and explicit mechanistic hypotheses, achieving state-of-the-art performance on drug perturbation benchmarks.

When Machine Learning Gets Personal: Evaluating Prediction and Explanation

This paper proposes a unified framework to quantify the impact of model personalization on both prediction accuracy and explanation quality. It proves that these two dimensions can be decoupled (explanations may improve or degrade while predictions remain unchanged), derives finite-sample lower bounds on hypothesis testing error probabilities based on dataset statistics, and reveals that in many practical settings the benefit of personalization is statistically untestable in principle.

When Thinking Backfires: Mechanistic Insights Into Reasoning-Induced Misalignment

This paper identifies and mechanistically explains Reasoning-Induced Misalignment (RIM): enhancing reasoning capability (via CoT prompting or math fine-tuning) degrades safety guardrails, because reasoning and safety share neuronal resources, and safety-critical neuron activations undergo disproportionate shifts during reasoning training.

ZeroTuning: Unlocking the Initial Token's Power to Enhance Large Language Models Without Training

This paper proposes ZeroTuning, a training-free method that improves LLM performance across 15 datasets by applying head-specific scaling to the attention scores of the initial token (e.g., <BOS>), requiring only 4 lines of code modification.