Hedonic Neurons: A Mechanistic Mapping of Latent Coalitions in Transformer MLPs¶

Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=v6HPsCu2R8
Code: https://github.com/TaKneeAa/hedonicNeurons
Area: Mechanistic Interpretability / Cooperative Game Theory / LLM Internal Representations
Keywords: hedonic game, neuron coalition, PAC-stable, LoRA, MLP, synergy, mechanistic interpretability

TL;DR¶

By treating neurons in Transformer MLPs as "rational players" in a cooperative game, this work employs hedonic games and the PAC-Top-Cover algorithm to identify neuron coalitions where "joint ablation effects superimpose non-linearly." This reveals how LoRA fine-tuning encodes task features within synergistic neuron groups.

Background & Motivation¶

Background: LoRA fine-tuning teaches LLMs new tasks with minimal parameters, and these new features are primarily concentrated in mid-layer MLPs. However, LoRA's low-rank updates "diffuse" new feature directions across thousands of neurons, leaving the weight updates structurally opaque to manual inspection.
Limitations of Prior Work: Existing interpretability tools focus on "individual units" or "statistical proximity." Probing captures correlations between neurons and labels but ignores synergy; SAEs decompose activations into monosemantic directions but erase non-linear dependencies; clustering groups neurons by statistical similarity rather than functional interaction. None address the core question: Which neuron subsets are synergistic (joint contribution > sum of individual contributions)?
Key Challenge: Features are not computed by isolated neurons but by "partnerships" of neurons. However, enumerating all $2^{n-1}$ possible subsets to find coalitions is computationally infeasible.
Goal: Provide a theoretically grounded and scalable framework to automatically discover stable synergistic coalitions within MLP layers and track their evolution (persistence, splitting, merging, vanishing) across depths.
Core Idea: [Game Theory Analogy] SGD exerts "selection pressure" on neurons—only directions that reduce loss survive, and many neurons are only useful when combined. Thus, neurons can be viewed as agents in a hedonic game where utility measures "how much my survival depends on synergy with others." Stable coalitions = groups of neurons that survived together during training.

Method¶

Overall Architecture¶

The method consists of two stages: first, formalizing the search for synergistic coalitions within a single layer as a cooperative game with hedonic utility, solved via the PAC-Top-Cover algorithm to find PAC-stable neuron partitions; second, using maximum weight bipartite matching to connect coalitions in adjacent layers and track how these "meta-neurons" evolve with depth.

flowchart LR
    A[LoRA Fine-tuned MLP Layer] --> B[Calculate Pairwise Value φ_ij<br/>OCA / PAS]
    B --> C[Multi-Friend Choice Set<br/>Top-k partners for each neuron]
    C --> D[PAC-Top-Cover<br/>Sampling + Preference Digraph + Sink SCC]
    D --> E[Intra-layer Stable Coalitions π_ℓ]
    E --> F[Adjacent Layer Interaction Mass<br/>Max-weight Bipartite Matching]
    F --> G[Tracking: Persistence/Splitting/Merging/Vanishing]

Key Designs¶

1. Pairwise Value Functions: Complementary OCA and PAS Routes — The process begins by estimating a synergy score $\phi_{ij}$ for each neuron pair $(i,j)$, where positive values indicate synergy and negative values indicate redundancy. The structural heuristic OCA (Orthogonal-Co-Activation) assumes neurons calculating complementary features have "orthogonal weights but correlated activations": $\phi_{OCA}(i,j)=(1-|\cos(W_i,W_j)|)\,\rho(a_i,a_j)$, where $\rho$ is the Pearson correlation of activations. The functional PAS (Pairwise Ablation Synergy) directly measures second-order interactions by ablating neurons back to pre-LoRA weights: $\Delta_{ij}(x)=\ell(x)-\ell_{-i}(x)-\ell_{-j}(x)+\ell_{-(i,j)}(x)$, with $\phi_{PAS}(i,j)=\mathbb{E}_x[\Delta_{ij}(x)]$. Positive values indicate that "joint ablation changes the loss more than the sum of individual ablations." To scale for large $n$, PAS is approximated using the mixed second derivative $\partial^2\ell/\partial a_i\partial a_j$ multiplied by activation differences. These two routes examine the robustness of the framework from structural and functional perspectives.

2. Multi-Friend Choice Sets for Tractability (Top-Responsiveness) — Since enumerating all preferences is impossible, the game is restricted to be top-responsive: each neuron only cares about its top-k partners. Specifically, the choice set of neuron $i$ within coalition $S$ consists of its $k$ highest-scoring partners: $\mathrm{Ch}(i,S)=\arg\max_{T\subseteq S\setminus\{i\},|T|=k}\sum_{j\in T}\phi_{ij}$, and utility is normalized as $u_i(S)=\frac{1}{k}\sum_{j\in\mathrm{Ch}(i,S)}\phi_{ij}$. This allows neurons to compare only their most valuable sets of partners rather than all possible groupings, capturing "multi-partner synergy" where a neuron is meaningful only when several complementary features are present.

3. PAC-Top-Cover: Scalability via Sampling + Stability Guarantees — This is the engine for solving the game. It repeatedly samples coalitions from a distribution $D$ (sampling size $s$, then a uniform subset of that size), retains the coalition $T_i^\star$ with the highest MFC utility for each player, and estimates the choice set $B_i$ based on top-k outcomes. A preference digraph is constructed ($i\to j$ if $j\in B_i$), and sink strongly connected components (closed under the choice set) are output as coalitions. Nodes are then removed, and the process iterates. Theoretically, only $m=\mathrm{poly}(n,1/\epsilon,\log(1/\delta))$ samples are needed to obtain an $\epsilon$-PAC stable partition with probability $\ge 1-\delta$, meaning the probability of observing a blocking coalition under distribution $D$ is $\le\epsilon$. This provides theoretical support that the discovered coalitions reflect robust cooperative structures.

4. Cross-layer Tracking: Interaction Mass + Max-weight Matching — Treating coalitions as "meta-neurons," their evolution across depths is analyzed. For coalition pairs $(C,C')$ in adjacent layers, interaction mass is defined as $M(C,C')=\frac{1}{|C||C'|}\sum_{p\in C}\sum_{q\in C'}(W^{(\ell+1)}_{up}[q,p]+W^{(\ell+1)}_{gate}[q,p])\cdot A_p$, covering both additive (up) and gated multiplicative (gate × SiLU) pathways, normalized by coalition size. A bipartite matrix of these masses is solved via maximum weight matching to align coalitions. Transitions are categorized based on the ratio of source output to target input ($\alpha$) and target input from source ($\beta$): persistence (both high), splitting (low $\alpha$, high $\beta$), merging (high $\alpha$, low $\beta$), and vanishing (both low). This step is exploratory, as residual connections allow neurons to influence all deeper layers.

Key Experimental Results¶

Models: LLaMA-3.1-8B, Mistral-7B-v0.1, and Pythia-6.9B, all fine-tuned using LoRA (rank 8) on MLP layers 7–14. Tasks: Three scalar objectives from MS MARCO — CQTR (Query Term Coverage), Mean-TF/L (Length-normalized Mean Term Frequency), and RM (Supervised Ranking, NDCG). OOD evaluation on TREC DL-19/20.

Main Results (Extrinsic Evaluation: OOD Drop ↑ and Feature Alignment $R^2$ ↑)¶

Task/Algorithm	LLaMA OOD Drop	LLaMA Align $R^2$	Mistral OOD Drop	Pythia Align $R^2$
K-means	0.02	0.12	0.03	0.11
Hier. clustering	0.03	0.15	0.03	0.13
Hedonic (OCA)	0.07	0.41	0.09	0.45
Hedonic (PAS)	0.11	0.58	0.13	0.63

(Selected data for CQTR; for RM, Hedonic-PAS reaches OOD Drop of 0.14–0.17 and Align $R^2$ of 0.63–0.67). Jointly ablating a hedonic coalition causes a performance drop 3–5× larger than clustering. Activation alignment with IR heuristics (BM25/IDF/Coverage) improves from $R^2 \approx 0.11\text~~}0.18$ to $0.55\text{~~0.67$.

Ablation Study (Predictive Power: Coalition as Macro-features for Ridge Regression, OOD $R^2$ ↑)¶

Algorithm	CQTR	Mean-TF/L	RM
Random	0.08	0.09	0.12
K-means	0.16	0.15	0.21
Hier. clustering	0.18	0.17	0.21
Hedonic (OCA)	0.34	0.33	0.38
Hedonic (PAS)	0.43	0.42	0.47

Using coalitions as macro-features, PAS achieves 2–3× the OOD $R^2$ of clustering, with OCA consistently second. This indicates that utility-respecting synergy produces robust transferable features.

Key Findings¶

Cross-layer Dynamics (Layers 7→14): Vanishing dominates (typically 60–75% of coalitions disappear in the next layer), splits are common (20–50%), merges are nearly zero, and persistence is generally <12%. This supports the claim that deep MLPs primarily perform feature filtering/refinement rather than creation; once synergistic units are formed, they are mostly pruned or refined rather than fused.
Mean-TF/L shows the most aggressive pruning (vanishing >70%), aligning with the intuition that simple frequency statistics are isolated early and discarded later.
Confidence intervals are consistently narrow across three seeds.

Highlights & Insights¶

First application of cooperative game theory at the neuron level: Employs hedonic games and PAC stability—tools with theoretical guarantees—to discover and track synergistic neuron groups in fine-tuned LLMs. The perspective is novel and consistent with training dynamics (SGD selection pressure ↔ stable coalitions).
Beyond disentanglement to "higher-order structures": Unlike SAEs which re-express activation space, hedonic coalitions treat neurons as fundamental units rooted in weight geometry and preference structures, capturing non-linear synergies invisible to SAEs or clustering.
Dual Causal + Semantic Validation: Coalitions are functionally indispensable (ablation causes significant OOD degradation) and semantically interpretable (aligned with BM25/IDF), avoiding post-hoc "storytelling."
Actionable Insights from "Meta-neuron" Tracking: Coalition-level persistence/splitting/vanishing points toward interventions at the coalition granularity (editing, merging, transfer) rather than individual weights.

Limitations & Future Work¶

Local Cross-layer Dynamics: Because interaction mass only looks at adjacent layers, it likely underestimates long-range interactions mediated by residual connections. Cross-layer tracking is explicitly characterized as "exploratory."
High Computational Cost for PAS: Estimating PAS values requires second-order interactions/mixed derivatives. PAC-Top-Cover takes 280 minutes under PAS (vs. 90 minutes for OCA) on 4×A100 GPUs, making full-model scaling expensive.
Narrow Task Scope: Experiments focus on IR scalar output tasks and mid-layers (7–14). Generality to generation, classification, or other depths remains to be verified.
"Rational Agents" as Analogy: Neurons are not actually rational; the framework's explanatory power relies on the assumption of SGD selection pressure. Causal claims should be treated with caution.

Mechanistic Interpretability: Contrasts with probing, SAEs (Huben et al. 2024), and neuron clustering by explicitly modeling "synergy" rather than "correlation/monosemanticity/proximity."
Cooperative Games & PAC Stability: Directly builds on hedonic games (Dreze & Greenberg 1980), Top-Covering algorithms for top-responsive games, and PAC stabilization (Sliwinski & Zick 2017).
LoRA & Mid-layer MLPs: Motivated by observations (Hu et al. 2022, Nijasure et al. 2025) that LoRA primarily updates mid-layer MLPs.
Insight: Treating "synergy/coalition" as a first-class citizen in interpretability suggests a hierarchical understanding of LLMs (Unit → Coalition → Coalition Dynamics).

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First introduction of hedonic games + PAC stability to neuron-level interpretability.
Experimental Thoroughness: ⭐⭐⭐⭐ Three models, three tasks, dual intrinsic/extrinsic evaluation; however, limited to IR tasks and mid-layers.
Writing Quality: ⭐⭐⭐⭐ Clear chain of logic; persuasive analogies; formulas for PAS and interaction mass are dense but supported by appendices.
Value: ⭐⭐⭐⭐ Provides "synergistic coalitions" as a new actionable analysis unit for understanding LoRA fine-tuning.

Hedonic Neurons: A Mechanistic Mapping of Latent Coalitions in Transformer MLPs¶

TL;DR¶

Background & Motivation¶

Method¶

Overall Architecture¶

Key Designs¶

Key Experimental Results¶

Main Results (Extrinsic Evaluation: OOD Drop ↑ and Feature Alignment \(R^2\) ↑)¶

Ablation Study (Predictive Power: Coalition as Macro-features for Ridge Regression, OOD \(R^2\) ↑)¶

Key Findings¶

Highlights & Insights¶

Limitations & Future Work¶

Rating¶

Hedonic Neurons: A Mechanistic Mapping of Latent Coalitions in Transformer MLPs¶

TL;DR¶

Background & Motivation¶

Method¶

Overall Architecture¶

Key Designs¶

Key Experimental Results¶

Main Results (Extrinsic Evaluation: OOD Drop ↑ and Feature Alignment \(R^2\) ↑)¶

Ablation Study (Predictive Power: Coalition as Macro-features for Ridge Regression, OOD \(R^2\) ↑)¶

Key Findings¶

Highlights & Insights¶

Limitations & Future Work¶

Related Work & Insights¶

Rating¶

Related Papers¶