Skip to content

📦 Model Compression

💬 ACL2026 · 59 paper notes

📌 Same area in other venues: 📷 CVPR2026 (98) · 🔬 ICLR2026 (241) · 🧪 ICML2026 (117) · 🤖 AAAI2026 (60) · 🧠 NeurIPS2025 (140) · 📹 ICCV2025 (52)

🔥 Top topics: LLM ×19 · Model Compression ×9 · Compression ×4 · Reasoning ×4 · Alignment/RLHF ×3

A BERTology View of LLM Orchestrations: Token- and Layer-Selective Probes for Efficient Single-Pass Classification

This paper treats the \(token \times layer\) hidden state tensor of a production LLM as a minable resource. By utilizing a two-stage aggregation probe that "compresses tokens first, then layers," it performs safety/sentiment classification within the same forward pass. With only 35M trainable parameters, it approaches the performance of standalone guard models while eliminating an extra LLM call.

A Layer-wise Analysis of Supervised Fine-Tuning

This work conducts a layer-wise analysis of SFT in 1B-32B models through information-theoretic, geometric, and optimization perspectives. It finds that instruction-following capabilities are concentrated in the middle layers (20%-80%) rather than being uniformly distributed. Based on this, a Mid-Block Efficient Tuning strategy is proposed to selectively update middle layers, achieving up to a 10.2% improvement on GSM8K over standard LoRA.

Adaptive Layer Selection for Layer-Wise Token Pruning in LLM Inference

Ours proposes ASL (Adaptive Selection Layer), which adaptively determines the layer location for KV cache pruning by monitoring the variance of token attention score rankings. It significantly outperforms fixed-layer selection methods on difficult tasks while remaining training-free.

Alignment Tuning for Large Language Models: A Data-Centric Lens on Alignment Data Pipelines

This paper reinterprets LLM alignment tuning as a dynamic data pipeline design problem: what the model ultimately learns depends not only on optimization algorithms like PPO, DPO, or GRPO, but also on how candidate responses are generated, how preferences are evaluated, and how preference signals are instantiated as training objectives.

Analytical FFN-to-MoE Restructuring via Activation Pattern Analysis

This paper proposes an analytical post-training framework that rapidly restructures dense FFNs into sparse MoEs through neuron activation pattern analysis. By distinguishing high-frequency shared experts from low-frequency routed experts and constructing routers derived from activation statistics, the method achieves a 1.17× speedup with fine-tuning on only 2k samples.

ArcLight: A Lightweight LLM Inference Architecture for Many-Core CPUs

ArcLight is a lightweight LLM inference framework written from scratch (approximately 10 C++ files) designed for many-core CPUs with multiple NUMA nodes. By utilizing NUMA-local memory pools, multi-view thread pools, cross-NUMA tensor parallelism, and asynchronous subgraph synchronization, it breaks the "remote memory wall." On a 192-core ARM Kunpeng platform, it improves the decode throughput of Qwen3-4B Q4_0 by up to 46% compared to llama.cpp.

BaseCal: Unsupervised Confidence Calibration via Base Model Signals

Observing that base LLMs remain well-calibrated on free-form QA while post-trained LLMs (PoLLMs) are severely overconfident, BaseCal proposes two unsupervised schemes—feeding PoLLM's answers into the base LLM to use token probabilities as confidence (BaseCal-ReEval), or using a linear projection layer to map PoLLM's final hidden states back to the base LLM space and passing them through the base output layer (BaseCal-Proj). This achieves an average 42.9% relative reduction in ECE compared to the best unsupervised baseline across 5 datasets \(\times\) 3 model families.

Calibrated Speculative Decoding: Frequency-Guided Candidate Selection for Efficient Inference

CSD proposes a training-free enhancement framework for speculative decoding. It utilizes Online Correction Memory (OCM) to record high-frequency rejection patterns for rescuing candidates, and employs Semantic Consistency Gating (SCG) to verify candidate reliability based on probability ratios. This approach improves speculative decoding throughput by up to 2.33× while simultaneously increasing accuracy on HumanEval and MATH500.

CBRS: Cognitive Blood Request System with Bilingual Dataset and Dual-Layer Filtering

CBRS proposes a multi-platform framework that efficiently detects and parses blood donation requests from social media streams via a dual-layer filtering architecture (lightweight classifier + LLM). It constructs the first dataset containing 11K Bengali-English-Transliterated Bengali blood donation requests, where a LoRA-fine-tuned Llama-3.2-3B achieves a 92% zero-shot accuracy in parsing tasks.

Cognitive-Uncertainty Guided Knowledge Distillation for Accurate Classification of Student Misconceptions

This paper proposes a two-stage knowledge distillation framework using "Dual-level Marginal Sample Selection" based on teacher cognitive uncertainty and a difficulty-adaptive loss. Utilizing only 10.30% of real samples for incremental training, the 4B student model achieves MAP@3 = 0.9585 (+17.8% Gain) on MAP-Charting. On a benchmark of 220 middle school algebra misconceptions, it reaches 84.38% accuracy, surpassing GPT-5 (67.73%) and the directly fine-tuned 72B teacher (81.25%), while being 23× faster during inference.

DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing

The DASH-KV framework reformulates the attention mechanism as an approximate nearest neighbor search problem. By utilizing asymmetric deep hashing, it replaces high-dimensional floating-point similarity calculations with efficient Hamming distance bitwise operations. Combined with a dynamic mixed-precision mechanism, it reduces long-context inference complexity from \(O(N^2)\) to \(O(N)\) while matching full-attention performance.

DeepPrune: Parallel Scaling without Inter-Trace Redundancy

This paper proposes DeepPrune, which trains a specialized judge model to predict answer equivalence from partial reasoning traces. By combining this with an online greedy clustering algorithm to dynamically prune redundant parallel CoT paths, it reduces token consumption by 65.73%-88.50% while maintaining competitive accuracy (within 3 percentage points).

Efficient Learned Data Compression via Dual-Stream Feature Decoupling

This paper proposes the FADE framework, which separates micro-syntax and macro-semantic features into parallel shallow streams using a Dual-stream Multi-scale Decoupler (replacing deep serial stacking). Combined with a Hierarchical Gated Refiner and a Concurrent Stream Parallel Pipeline, it achieves SOTA performance in both compression ratio and throughput.

Enabling Agents to Communicate Entirely in Latent Space

This paper proposes Interlat, a framework that enables LLM agents to communicate entirely in latent space. The sender transmits the final layer's hidden states as a continuous representation of "thought." The receiver interprets these latent messages via a communication adapter and further compresses them to just 8 tokens through latent space reasoning while maintaining competitive performance, achieving a communication speedup of up to 24×.

Establishing a Scale for Kullback–Leibler Divergence in Language Models Across Various Settings

This paper utilizes log-likelihood vectors to embed language models of various architectures into a unified space, systematically measuring characteristic scales of KL divergence across settings including pre-training, model scale, random seeds, quantization, fine-tuning, and layers. It discovers that pre-training trajectories exhibit sub-diffusive behavior in log-likelihood space—model output distributions stabilize early despite continuous drifting in the weight space.

Evolutionary Negative Module Pruning for Better LoRA Merging

The ENMP method is proposed to discover and prune "negative modules" that degrade performance during LoRA merging through an evolutionary search strategy. As a plug-and-play enhancement, it comprehensively improves the performance of existing merging algorithms in both NLP and vision domains.

FastKV: Decoupling of Context Reduction and KV Cache Compression for Prefill-Decoding Acceleration

This paper proposes FastKV, which decouples context reduction (Token-Selective Propagation during prefill) from KV cache compression (layer-wise KV retention during decoding). It achieves 1.82× prefill and 2.87× decoding speedup on LLaMA-3.1-8B-Instruct, while maintaining accuracy within a 1% drop on LongBench.

Find Your Optimal Teacher: Personalized Data Synthesis via Router-Guided Multi-Teacher Distillation

Ours propose PerSyn (Personalized data Synthesis), which utilizes a "Route-then-Generate" paradigm where a router assigns the optimal teacher model for each prompt. By considering both student learnability and teacher response quality, this approach is more efficient and effective than the traditional "Generate-then-Select" paradigm, consistently surpassing all baselines in both instruction tuning and mathematical reasoning scenarios.

From Signal Degradation to Computation Collapse: Uncovering the Two Failure Modes of LLM Quantization

Through systematic mechanistic interpretability analysis, this paper reveals that LLM quantization exhibits two qualitatively different failure modes: 4-bit Signal Degradation (computational patterns remain intact but precision is impaired, allowing for local repair) and 2-bit Computation Collapse (functional destruction of key components, requiring structural reconstruction).

GlimpRouter: Efficient Collaborative Inference by Glimpsing One Token of Thoughts

This paper proposes GlimpRouter: in step-level LRM collaborative inference, the small model first decodes only the "first token" of each reasoning step. Its entropy \(\mathbf{H}_{\text{init}}\) is used to estimate step difficulty; if low, the small model continues; if high, it switches to the large model. It is training-free, requires no large model verifier, achieves +10.7% accuracy with −25.9% latency improvement on AIME25 compared to a standalone large model, and is orthogonally compatible with token-level Speculative Decoding.

GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models

GRASPrune proposes a structured pruning framework with global budget constraints. By using a Projected Straight-Through Estimator (Projected STE) to enforce hard mask budget constraints during each training step, it jointly prunes FFN channels and KV head groups. It achieves 12.18 PPL on LLaMA-2-7B with 50% parameter retention, requiring only 6 minutes of training on a single A100 GPU.

HeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM Inference

HeteroCache is proposed as a training-free dynamic KV cache compression framework. Based on the temporal heterogeneity (stable heads vs. drifting heads) and intra-layer redundancy (clustering of similar heads) of attention heads, it implements a fine-grained role assignment strategy—allocating larger cache budgets to drifting heads and utilizing representative heads for sparse monitoring of attention drift to trigger asynchronous on-demand retrieval. It achieves a 3x decoding speedup on a 224K context.

IMPACT: Importance-Aware Activation Space Reconstruction

The IMPACT framework is proposed to shift LLM low-rank compression from minimizing weight reconstruction error to minimizing importance-weighted activation reconstruction error. By incorporating gradient information into the activation covariance matrix, a closed-form optimal solution is derived, achieving up to 55.4% model size reduction while maintaining accuracy.

CadLLM: Improving the Throughput of Diffusion-based LLMs via Training-Free Confidence-Aware Calibration

CadLLM is proposed as a training-free adaptive inference acceleration method that utilizes token decoding confidence signals from diffusion language models (dLLMs) to dynamically adjust four dimensions: block size, step count, vocabulary sampling range, and submission threshold. It achieves 1.1-2.28× throughput gains on LLaDA and DREAM while maintaining competitive accuracy.

IntroLM: Introspective Language Models via Prefilling-Time Self-Evaluation

IntroLM appends special [CPX] introspective tokens to the end of a prompt and utilizes a "token-conditional LoRA" that is active only for these tokens. It calculates the probability of the model answering correctly during the prefilling stage. This self-evaluation does not enter the KV cache and does not affect generation. On long-context QA tasks like HotpotQA, it achieves a ROC-AUC 14 points higher than DeBERTa-v3-Large. When used for model routing, it saves up to 50% of large model calls and reduces end-to-end latency by 33%.

JudgeMeNot: Personalizing Large Language Models to Emulate Judicial Reasoning in Hebrew

This paper proposes a synthetic-organic supervision pipeline to transform raw judicial rulings into reasoning instruction tuning data. It achieves high-fidelity simulation of individual judges' reasoning styles through a "Chain-of-LoRA" strategy (CLM → Instruction Tuning). In Hebrew low-resource scenarios, the generated content is indistinguishable from real judicial writing.

Latent-Condensed Transformer for Efficient Long Context Modeling

LCA proposes performing context compression directly within the latent space of MLA—utilizing query-aware weighted pooling to aggregate semantic latent vectors and anchor selection for positional keys to maintain positional accuracy—achieving 2.5× prefill acceleration and 90% KV cache compression in 128K contexts while maintaining competitive performance.

LEAP: Layer-wise Exit-Aware Pretraining for Efficient Transformer Inference

The authors theoretically and empirically demonstrate that "layer-wise alignment distillation" and "convergence-based early exit" are systemically incompatible under standard deployment—distilled models utilize every layer efficiently, leaving no redundancy for early exit. They propose LEAP, a zero-additional-parameter auxiliary training objective that forces intermediate layers to approximate the final layer representation early. LEAP achieves a .61× measured wall-clock speedup on MiniLM-L12 (with batch=1, 91.9% of samples exit at L7).

LightReasoner: Can Small Language Models Teach Large Language Models Reasoning?

LightReasoner uses the token distribution difference between a weaker Amateur model and a stronger Expert model to automatically identify high-value reasoning steps. It then performs contrastive self-distillation only on these steps, allowing mathematical reasoning models to match or exceed SFT performance while significantly reducing sampling, training time, and tuning tokens.

LLM Prompt Duel Optimizer: Efficient Label-Free Prompt Optimization

This paper formalizes label-free prompt optimization as a dueling bandit problem and proposes the Prompt Duel Optimizer (PDO). By utilizing Double Thompson Sampling to efficiently select the most informative prompt pairs for comparison and combining it with a top-performer mutation strategy to expand the search space, PDO identifies stronger prompts with fewer judge calls on BBH and MS MARCO.

LoRA on the Go: Instance-level Dynamic LoRA Selection and Merging

LoGo (LoRA on the Go) is proposed as a training-free framework that extracts LoRA activation signals (norm or entropy) via a single forward pass to dynamically select and merge the most relevant LoRA adapters at the instance level, enabling cross-task generalization without labeled data or additional training.

MeepleLM: A Virtual Playtester Simulating Diverse Subjective Experiences

The authors develop a "virtual playtester" for board game designers. By providing official rulebooks and five distinct player personas to a fine-tuned Qwen3-8B (MeepleLM), the system performs three-step reasoning via the Mechanics→Dynamics→Aesthetics (MDA) framework to generate ratings and reviews. Across 207 games, MeepleLM outperforms GPT-5.1 and Gemini3-Pro in community distribution alignment (Wasserstein 0.22 vs. GPT-5.1's 0.95), content diversity (Div 4.34 vs. 4.26), and Opinion Recovery (69.77 vs. 63.44), while achieving over 70% preference in blind A/B testing.

MTA: Multi-Granular Trajectory Alignment for Large Language Model Distillation

MTA advances LLM distillation from "aligning specific static layers" to "aligning representation evolution trajectories based on network depth": lower layers align word-level information, while higher layers align phrase-level relationship geometry. As a plug-in, it consistently improves the ROUGE-L performance of FDD, DistiLLM, and DistiLLM-2 on instruction-following tasks.

No-Worse Context-Aware Decoding: Preventing Neutral Regression in Context-Conditioned Generation

Ours propose NWCAD, a decoding-time adapter that utilizes a two-stage gating mechanism to precisely fall back to context-free decoding when the context is uninformative (preventing neutral regression) and leverage the context for correction when it is helpful, balancing "do-no-harm" and "effectiveness."

Not All Directions Matter: Towards Structured and Task-Aware Low-Rank Model Adaptation

Ours proposes StructLoRA: it utilizes an Information Bottleneck (IB) to filter out task-irrelevant directions in low-rank updates and employs a Graph Neural Network (GNN) during training to coordinate LoRA updates across different layers. It consistently outperforms LoRA, AdaLoRA, DoRA, and Sensitivity-LoRA across language, vision, and multimodal tasks while maintaining zero additional inference overhead.

Polynomial Expansion Rank Adaptation: Enhancing Low-Rank Fine-Tuning with High-Order Interactions

This paper proposes PERA (Polynomial Expansion Rank Adaptation), which expands the linear adaptation space of LoRA into a polynomial manifold by introducing structured polynomial expansion (square and cross terms) within the parameter space of low-rank factors. It significantly enhances weight update expressiveness without increasing rank or inference overhead, consistently outperforming methods such as LoRA, DoRA, and HiRA on commonsense reasoning and NLU tasks.

ProActor: Timing-Aware Reinforcement Learning for Proactive Task Scheduling Agents

ProActor advances conversational task scheduling from "reacting to explicit user instructions" to "proactively triggering actions at appropriate timings." Through automated reference action annotation, proactiveness metrics, turn-level GRPO, and the ART-F efficient training framework, the 4-bit Qwen2.5-14B-ProActor-Q4 achieves a top PRI of 0.7293 on ABCD+, significantly enhancing proactive timing while maintaining action consistency.

Quantize What Counts: More for Keys, Less for Values

This paper theoretically demonstrates from a linear algebra perspective that the spectral and Frobenius norms of Key weights in Transformers are systematically larger than those of Value weights. Based on this, it proposes a Key-priority mixed-precision KV cache quantization strategy (e.g., K4V2), which reduces memory by 25% while maintaining 98.3% of full-precision accuracy.

Reason Only When Needed: Efficient Generative Reward Modeling via Model-Internal Uncertainty

The E-GRM framework is proposed to estimate uncertainty using the convergence behavior of model-internal parallel decoding. CoT reasoning is triggered only when necessary, and a discriminative scorer trained with hybrid loss evaluates reasoning path quality. This achieves SOTA performance on multiple reward model benchmarks while reducing inference latency by 62%.

Rethinking Parameter Sharing for LLM Fine-Tuning with Multiple LoRAs

This paper overturns the common assumption that "multiple LoRAs should share the A matrix" by demonstrating that the similarity of A primarily stems from identical initialization rather than shared knowledge. It proposes ALoRA and Fed-ALoRA, which share the B matrix, achieving a balance between performance, fairness, and communication efficiency in multi-task and federated fine-tuning scenarios.

Rethinking Table Pruning in TableQA: From Sequential Revisions to Gold Trajectory-Supervised Parallel Search

This paper proposes TabTrim, transforming table pruning from error-prone single-path sequential revisions into a "SQL trajectory-supervised pruner + loss-aware verifier + parallel trajectory search" framework. It improves average accuracy to 73.5% on WikiTQ, TabFact, and TableBench, significantly outperforming the strongest baseline by 3.2 percentage points.

RouteNLP: Closed-Loop LLM Routing with Conformal Cascading and Distillation Co-Optimization

RouteNLP is a closed-loop LLM routing and cascading framework that co-optimizes model combinations using task-aware routers, conformal calibrated cascading, and failure-cluster-directed distillation. It achieves a 0.159 cost ratio while maintaining a 0.971 quality ratio across a six-task enterprise benchmark, and reduced inference costs by 58% in an 8-week customer service pilot while maintaining a 91% response acceptance rate.

SAMoRA: Semantic-Aware Mixture of LoRA Experts for Task-Adaptive Learning

SAMoRA addresses the issues of imprecise routing and lack of flexibility in weight fusion in existing MoE-LoRA methods through a semantic-aware router and a task-adaptive scaling mechanism. It achieves SOTA performance on multi-task benchmarks with minimal trainable parameters (0.15%).

Social Story Frames: Contextual Reasoning about Narrative Intent and Reception

This paper proposes SocialStoryFrames, utilizing a reader response taxonomy with 10 dimensions and two distilled models to place Reddit stories back into community and conversational contexts. It infers narrative intent, reader sentiment, and value judgments, demonstrating a more granular analysis of community narrative practices across 6,140 social media stories compared to semantic similarity.

SRA: Span Representation Alignment for Large Language Model Distillation

SRA replaces the fragile token-level alignment unit in cross-tokenizer LLM distillation with tokenizer-agnostic text spans. By utilizing LCS character offset matching, attention-weighted center-of-mass representations, geometric structure regularization, and shared vocabulary span logit distillation, it consistently outperforms ULD, MinED, DSKD, and MultiLevelOT across multiple teacher-student compression experiments.

SSSD: Simply-Scalable Speculative Decoding

The authors propose SSSD, a training-free speculative decoding method that combines lightweight n-gram matching with hardware-aware speculation length adjustment. Without requiring the training or deployment of any draft models, it achieves up to 2.9× inference speedup and demonstrates superior robustness compared to training-based methods in language/domain migration and long-context scenarios.

Stable On-Policy Distillation through Adaptive Target Reformulation

This paper proposes Veto, a target-level reformulation method that stabilizes on-policy knowledge distillation by constructing a teacher-student geometric bridging distribution in logit space. A single parameter \(\beta\) simultaneously acts as an adaptive gradient vetoer in forward KL (suppressing harmful gradients from low-confidence tokens) and a decisiveness knob in reverse KL (balancing reward-driven behavior and output diversity). It achieves a 9.2% improvement over SFT on GSM8K.

TalkLoRA: Communication-Aware Mixture of Low-Rank Adaptation for Large Language Models

TalkLoRA introduces a lightweight Talking Module into the MoE-LoRA architecture, allowing low-rank experts to exchange information before routing. This addresses the issues of unstable routing and expert dominance caused by independent expert operation in traditional MoELoRA. On commonsense reasoning and NLU tasks, it consistently outperforms LoRA and MoELoRA variants with fewer parameters (0.2%).

Task-Stratified Knowledge Scaling Laws for Post-Training Quantized LLMs

This paper establishes the first task-stratified knowledge scaling laws for post-training quantization (PTQ), categorizing LLM capabilities into memory, application, and reasoning layers. It provides a unified model for four factors: model size, bit-width, group size, and calibration set size. Validated across 293 PTQ configurations, the study reveals distinct patterns: reasoning is highly sensitive to precision, application scales with model size, and memory is sensitive to calibration.

TELL-TALE: Task Efficient LLMs with Task Aware Layer Elimination

TALE utilizes a retraining-free greedy search process to directly eliminate "impeding" Transformer layers for specific downstream tasks, simultaneously enhancing task accuracy and reducing inference costs across five open-source LLMs and nine benchmarks.

The Pitfalls of KV Cache Compression

This paper identifies that KV cache compression leads to selective forgetting and system prompt leakage in multi-instruction prompts. The issue stems from uneven eviction across different instructions and the erroneous deletion of critical tokens. The authors propose two simple modifications—whitelist retention and fair eviction—to significantly reduce leakage and stabilize instruction following.

TLoRA: Task-aware Low Rank Adaptation of Large Language Models

TLoRA uses training sample activation covariance to initialize and freeze the LoRA \(A\) matrix, then adaptively allocates rank and scaling factors based on module importance. This allows LLMs to achieve or exceed mainstream LoRA variants on NLU, common sense reasoning, math, code generation, and chat tasks using approximately half the trainable parameters.

Training-Free Test-Time Contrastive Learning for Large Language Models

This paper proposes TF-TTCL, a training-free test-time contrastive learning framework that enables frozen LLMs to self-improve online through an "explore-reflect-guide" loop. The framework utilizes multi-agent role-playing to generate diverse reasoning trajectories, distills textual rules from contrastive positive and negative samples into a memory bank, and retrieves relevant rules during inference to guide generation.

Two-Stage Regularization-Based Structured Pruning for LLMs

TRSP employs a first-stage regularization to learn the importance of each Transformer layer and a second-stage regularization to minimize the distance between the input and output of candidate layers. This facilitates the transfer of knowledge to retained layers, enabling layer-wise structured pruning and actual inference acceleration for LLMs without requiring retraining.

UKP_Psycontrol at SemEval-2026 Task 2: Modeling Valence and Arousal Dynamics from Text

UKP_Psycontrol achieved first place in both categories of SemEval-2026 Task 2 by combining LLM prompting, a MaxEnt model with Ising interactions, and neural regression models. The study found that LLMs excel at capturing static emotional signals, whereas short-term emotional changes are explained more by recent numerical trajectories than by textual semantics.

VecCISC: Improving Confidence-Informed Self-Consistency with Reasoning Trace Clustering and Candidate Answer Selection

VecCISC introduces "reasoning trace embedding clustering grouped by answer" prior to the confidence-weighted self-consistency of CISC. By submitting only the representative traces of each semantic cluster to the critic for scoring, it significantly reduces critic calls and token costs while maintaining or slightly improving accuracy.

When Reviews Disagree: Fine-Grained Contradiction Analysis in Scientific Peer Reviews

This paper advances reviewer disagreement analysis from sentence-pair binary classification to evidence extraction and intensity scoring on full reviews, utilizing the IMPACT multi-agent teacher to distill a TIDE student model deployable via a single forward pass.

Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics

This paper unifies local weight fine-tuning, LoRA, and activation steering into "control signal-induced dynamic weight updates." Using preference-utility log-odds and activation manifolds, it explains how strong control enhances target preference at the expense of generation utility. Based on this, the SPLIT training objective is proposed to better balance preference and utility across three types of interventions.

WISCA: A Lightweight Model Transition Method to Improve LLM Training via Weight Scaling

This paper proposes the Equivalent Model Theory and the WISCA weight scaling strategy. By dynamically adjusting the \(W_q/W_k\) and \(W_v/W_o\) weights of Transformer attention layers during training to equalize their L1 norms (while maintaining model output), the optimization is guided toward flatter local minima. This achieves an average 5.6% zero-shot evaluation improvement and a 2.12% reduction in training perplexity on GQA architectures.