Beyond the Surface: Enhancing LLM-as-a-Judge Alignment with Human via Internal Representations¶

Conference: NeurIPS 2025 arXiv: 2508.03550 Code: https://github.com/sustech-nlp/LAGER Area: LLM Evaluation Keywords: LLM-as-a-Judge, hidden representations, cross-layer aggregation, evaluation alignment, plug-and-play

TL;DR¶

This paper proposes LAGER, a framework that aggregates score token logits from intermediate to final layers of an LLM and computes an expected score to derive the final judgment. Without any model fine-tuning, LAGER improves human alignment by up to 7.5% and matches or surpasses reasoning-based methods without requiring chain-of-thought inference.

Background & Motivation¶

Background: LLM-as-a-Judge has become the dominant paradigm for automatic evaluation, yet improving its alignment with human judgments remains a central challenge. Existing approaches either rely on complex chain-of-thought reasoning (increasing computational cost) or require fine-tuning (sacrificing generalizability).

Limitations of Prior Work: The standard practice uses only the highest-probability score token from the final layer (vanilla score), which (a) discards rich information in the full probability distribution (e.g., when scores of 4 and 5 have similar probabilities but only 5 is selected), and (b) ignores potentially superior judgment signals encoded in intermediate layers.

Core Observation: Empirical analysis reveals that hidden representations from middle-to-upper layers frequently yield judgment scores more aligned with human ratings than those from the final layer — different layers encode complementary semantic and task-level information.

Key Insight: A better scoring distribution can be obtained by computing a weighted aggregation of score logits across all layers, followed by taking the expectation to produce a continuous, fine-grained score. Layer weights are trained lightly on a small validation set (only \(L+1\) parameters total), while the model backbone remains fully frozen.

Core Idea: Cross-layer logit weighted aggregation + probability distribution expectation = superior judgment scores compared to final-layer argmax, with a fully plug-and-play design.

Method¶

Overall Architecture¶

During LLM-based evaluation, at the position of the score token, hidden representations \(\mathbf{h}_n^{(l)}\) are extracted from all \(L+1\) layers (from the embedding layer to the final decoder layer). These representations are projected to logits via the shared unembedding matrix, aggregated with per-layer weights, and then passed through a softmax over the candidate score tokens to obtain a probability distribution, whose expectation serves as the final score.

Key Designs¶

Cross-Layer Logit Aggregation:
Function: \(\hat{\mathbf{z}} = \sum_{i=0}^{L} w_i [\mathbf{h}_n^{(i)} \mathbf{W}_{\text{unembd}}]_{\mathcal{M}}\), where \(\mathcal{M}\) is the set of candidate score tokens.
Design Motivation: Different layers encode information at different granularities — lower layers capture local lexical features, middle layers capture semantics, and upper layers capture task-level reasoning. Aggregation integrates perspectives from all layers.
Key Detail: Softmax normalization is not applied prior to aggregation (ablations confirm that pre-normalization discards relative scale information in the logits, leading to degraded performance).
Expected Score:
Function: \(s^* = \sum_{s \in \mathbb{S}} s \times P(s)\), where \(P(s) = \text{softmax}(\hat{\mathbf{z}})[s]\).
Design Motivation: Provides finer granularity than argmax — if \(P(4)=0.45\) and \(P(5)=0.55\), argmax assigns 5, while the expected score yields 4.55, better discriminating response quality.
This simple modification alone yields significant improvements (E-Score baseline).
Lightweight Weight Training:
Function: \(L+1\) layer weight parameters are trained on a small validation set using a joint CE+MAE loss (e.g., only 33 parameters for LLaMA-3.1-8B).
The model backbone is completely frozen; next-token prediction is unaffected.
Weights are trained once and reused across all benchmarks and downstream tasks.
Uniform aggregation without weight training (LAGER w.o. tuning) also yields substantial improvements.

Key Experimental Results¶

Main Results (Spearman Correlation, Direct Evaluation without Reasoning Chain)¶

Model	Method	Flask	HelpSteer	BIGGen	Avg.
LLaMA-3.1-8B	VScore	0.442	0.452	0.333	0.409
LLaMA-3.1-8B	E-Score	0.454	0.520	0.403	0.459
LLaMA-3.1-8B	LAGER	0.488	0.560	0.421	0.490
Qwen-2.5-14B	VScore	0.489	0.440	0.420	0.450
Qwen-2.5-14B	LAGER	0.528	0.524	0.449	0.500
LLaMA-3.3-70B	VScore	0.501	0.508	0.445	0.485
LLaMA-3.3-70B	LAGER	0.538	0.548	0.473	0.520

Comparison with Reasoning-Based Methods¶

Method	Flask	HelpSteer	BIGGen
VScore + Reasoning	0.456	0.470	0.388
LAGER (no reasoning)	0.488	0.560	0.421

Key Findings¶

LAGER achieves up to 7.5% average Spearman correlation improvement across three benchmarks.
LAGER matches or surpasses explicit reasoning methods without any chain-of-thought — shallow reasoning chains prove unreliable.
Uniform aggregation (no training) already yields significant gains; trained weights provide further improvement.
Aggregating before softmax outperforms aggregating after softmax: preserving logit scale information is critical.
Downstream validation: using LAGER to select instruction fine-tuning data outperforms multiple baselines on AlpacaEval-2.0.

Highlights & Insights¶

The finding that "intermediate layers understand judgment better than the final layer" is highly instructive: the final layer may lose certain judgment-relevant semantic signals due to over-optimization toward the next-token prediction objective.
Minimalist design: only 33 trainable parameters for an 8B model, fully plug-and-play without modifying the inference pipeline — likely the most lightweight approach to improving LLM-as-a-Judge currently available.
The improvement from expected score vs. argmax, while seemingly minor, is conceptually significant: it converts discrete scores into continuous ones, capturing the model's "uncertainty" between adjacent ratings.

Limitations & Future Work¶

Access to intermediate hidden representations is required, making the method inapplicable to API-only models (though degrading to E-Score still provides benefit).
Layer weights are trained on a validation set — if the validation distribution diverges significantly from the test distribution, performance may be suboptimal.
Only pointwise evaluation is addressed; pairwise comparison settings remain unexplored.
Weights are not transferable across models with different layer counts — separate weight training is required per model.
No mechanistic analysis is provided to explain why intermediate layers yield better judgment — deeper investigation into what judgment-relevant signals these layers encode is lacking.

vs. G-Eval: G-Eval computes the softmax expectation over final-layer logits (equivalent to E-Score); LAGER further aggregates across layers for superior performance.
vs. Prometheus/TIGERScore: These methods fine-tune the entire LLM, limiting generalizability. LAGER freezes the model and trains only 33 parameters.
vs. CoT reasoning methods: Reasoning-based approaches increase latency and yield unstable reasoning quality; LAGER requires no reasoning and is both faster and more accurate.

Rating¶

Novelty: ⭐⭐⭐⭐ Cross-layer aggregation for judgment is a novel perspective, though the core technique (weighted logit aggregation) is relatively straightforward.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Six models, three benchmarks, multiple ablations, and downstream application validation — comprehensive coverage.
Writing Quality: ⭐⭐⭐⭐⭐ Motivation is clearly articulated (the layer-wise analysis in Figure 2 is particularly convincing); methodological description is precise.
Value: ⭐⭐⭐⭐⭐ A plug-and-play, near-zero-cost improvement with direct practical value for any scenario employing LLM-as-a-Judge.