Skip to content

ELSPR: Evaluator LLM Training Data Self-Purification on Non-Transitive Preferences

Conference: AAAI 2026 arXiv: 2505.17691 Code: https://github.com/yy0525/ELSPR Area: LLM Pre-training Keywords: LLM Evaluation, Non-Transitive Preferences, Tournament Graph, Data Cleaning, Structural Entropy

TL;DR

ELSPR models pairwise preferences of LLM evaluators as tournament graphs, identifies non-transitive preferences via strongly connected components (SCCs), proposes a normalized directed graph structural entropy metric, and filters problematic training data through graph reconstruction — resulting in a 13.8% reduction in non-transitivity and a 0.088 decrease in structural entropy, while the discarded data achieves only 34.4% human agreement (vs. 52.6% for retained data).

Background & Motivation

Background: LLM-as-judge has been widely adopted to evaluate other models. Pairwise comparison is a common paradigm, but it may produce non-transitive preferences (A>B, B>C, C>A).

Limitations of Prior Work: (1) Non-transitive preferences undermine ranking reliability; (2) existing head-to-head evaluations neglect transitivity constraints; (3) the root cause of non-transitivity — low-quality or ambiguous training data — has not been systematically addressed.

Key Challenge: Ambiguous comparison pairs in training data cause evaluators to learn contradictory preference patterns.

Goal: Automatically identify and remove training data that induces non-transitive preferences.

Key Insight: Modeling preferences as a graph-theoretic problem — tournament graphs combined with SCC analysis.

Core Idea: Non-transitive relations are identified via SCCs in tournament graphs, which are then reconstructed into DAGs to filter inconsistent training samples.

Method

Overall Architecture

(1) Construct a tournament graph per question (with position-swapping to debias); (2) apply Tarjan's algorithm to find SCCs; (3) expand SCCs into a DAG by sorting nodes by in-degree → filter inconsistent training samples into "Cleaned" vs. "Discarded" sets; (4) fine-tune the evaluator on cleaned data via LoRA.

Key Designs

  1. Non-Transitivity Detection: An SCC with \(|S|>2\) containing node pairs without bidirectional edges is classified as a non-transitive SCC.
  2. Normalized Structural Entropy: \(\tau(G) = H^2(G) / \log_2 n\), measuring hierarchical uncertainty of the graph.
  3. SCC→DAG Reconstruction: Nodes are sorted by in-degree (as a capability estimate); contradictory intra-SCC edges are removed.
  4. LoRA Fine-tuning: rank=8, 3 epochs, lr=1e-4, batch=16; teacher model is Qwen2.5-Max.

Loss & Training

Standard preference learning loss. Qwen2.5-7B-Instruct and LLaMA3.1-8B-Instruct serve as backbone models.

Key Experimental Results

Main Results (Cross-validated Non-Transitivity Rate, averaged over 5 test sets)

Data Type \(\rho_{\text{non-trans}}\) \(\tau_{\text{avg}}\)
Raw 64.3% 0.811
Cleaned 50.5% 0.723
Δ -13.8% -0.088

Human Validation

Metric Cleaned Discarded
Human Agreement 52.6% 34.4%
Model–Human Agreement 80.6% 51.2%

Ablation Study

  • Random filtering (removing an equivalent amount of data) leads to higher non-transitivity, confirming the necessity of targeted filtering.
  • Cross-model validation: equally effective on LLaMA3.1 (Helpful_Base: 40.2% vs. 59.0%).
  • ~80% of training data is retained — maximum quality gain at minimal data loss.

Key Findings

  • Discarded data achieves only 34.4% human agreement, confirming these are genuinely low-quality samples.
  • Non-transitive preferences occur more frequently among response pairs with quality differences below the just-noticeable difference (higher Self-BLEU).
  • The cleaned model exhibits stronger discriminability on MT-bench (SD improvement of 2–4%).

Highlights & Insights

  • Graph-theoretic treatment of preference data is an elegant design choice — SCCs precisely capture "mutually contradictory evaluation cycles."
  • Validation of discarded data quality is highly convincing — 34.4% human agreement demonstrates that useful data is not being discarded, and model–human agreement drops from 80.6% to 51.2% (near random).
  • Structural entropy provides a novel quantitative tool for preference evaluation, generalizable to any scenario requiring assessment of preference consistency.
  • From a graph-theoretic perspective, non-transitive preferences are essentially "evaluation cycles" — SCCs are the canonical tool for capturing cycles, making the methodological choice highly natural.

Limitations & Future Work

  • Repeated inference at multiple temperatures is required to construct the graph (computational overhead), which may be impractical for large-scale preference datasets.
  • The quality of cleaned data is upper-bounded by the teacher model — inconsistent preferences in the teacher model will degrade cleaning effectiveness.
  • Only binary preferences are handled; the approach is not extended to rating-based or multi-candidate ranking scenarios, despite the growing prevalence of K-wise comparisons in LLM evaluation.
  • Using in-degree as a capability estimate for SCC→DAG reconstruction is a heuristic that may be inaccurate for highly imbalanced graphs.
  • The assumption that non-transitivity primarily stems from data quality rather than the inherent complexity of evaluation tasks may not hold for subjective tasks where a total ordering does not exist.
  • Filtering ~20% of training data reduces dataset size, which may cause excessive data loss in low-resource settings.
  • Directly relevant to RLHF data cleaning — non-transitivity in preference data may be a source of reward hacking, and cleaning preference data could fundamentally improve reward model quality.
  • vs. Xu et al. (2025): They identify non-transitivity in GPT-4 evaluations but propose no solution; ELSPR provides a systematic graph-theoretic remedy.
  • vs. Traditional Data Denoising Methods: General-purpose methods lack structural priors specific to preference data; ELSPR leverages the mathematical properties of tournament graphs for targeted filtering.
  • vs. Canoe (AAAI 2026): Canoe addresses faithfulness (models failing to follow context), while ELSPR addresses consistency of the evaluator itself — the two are complementary, with the former improving evaluated models and the latter improving evaluators.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ The combination of graph theory, structural entropy, and preference data cleaning is highly original; SCC analysis precisely captures non-transitive preferences.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Cross-validation across 5 test sets, human validation, and cross-model validation; rigorously designed.
  • Writing Quality: ⭐⭐⭐⭐ Formally rigorous; the derivation from tournament graphs to DAGs is clearly presented.
  • Value: ⭐⭐⭐⭐⭐ Directly applicable to improving the reliability of LLM evaluation systems and the quality of RLHF data.