ELSPR: Evaluator LLM Training Data Self-Purification on Non-Transitive Preferences¶
Conference: AAAI 2026 arXiv: 2505.17691 Code: https://github.com/yy0525/ELSPR Area: LLM Pre-training Keywords: LLM Evaluation, Non-Transitive Preferences, Tournament Graph, Data Cleaning, Structural Entropy
TL;DR¶
ELSPR models pairwise preferences of LLM evaluators as tournament graphs, identifies non-transitive preferences via strongly connected components (SCCs), proposes a normalized directed graph structural entropy metric, and filters problematic training data through graph reconstruction — resulting in a 13.8% reduction in non-transitivity and a 0.088 decrease in structural entropy, while the discarded data achieves only 34.4% human agreement (vs. 52.6% for retained data).
Background & Motivation¶
Background: LLM-as-judge has been widely adopted to evaluate other models. Pairwise comparison is a common paradigm, but it may produce non-transitive preferences (A>B, B>C, C>A).
Limitations of Prior Work: (1) Non-transitive preferences undermine ranking reliability; (2) existing head-to-head evaluations neglect transitivity constraints; (3) the root cause of non-transitivity — low-quality or ambiguous training data — has not been systematically addressed.
Key Challenge: Ambiguous comparison pairs in training data cause evaluators to learn contradictory preference patterns.
Goal: Automatically identify and remove training data that induces non-transitive preferences.
Key Insight: Modeling preferences as a graph-theoretic problem — tournament graphs combined with SCC analysis.
Core Idea: Non-transitive relations are identified via SCCs in tournament graphs, which are then reconstructed into DAGs to filter inconsistent training samples.
Method¶
Overall Architecture¶
(1) Construct a tournament graph per question (with position-swapping to debias); (2) apply Tarjan's algorithm to find SCCs; (3) expand SCCs into a DAG by sorting nodes by in-degree → filter inconsistent training samples into "Cleaned" vs. "Discarded" sets; (4) fine-tune the evaluator on cleaned data via LoRA.
Key Designs¶
- Non-Transitivity Detection: An SCC with \(|S|>2\) containing node pairs without bidirectional edges is classified as a non-transitive SCC.
- Normalized Structural Entropy: \(\tau(G) = H^2(G) / \log_2 n\), measuring hierarchical uncertainty of the graph.
- SCC→DAG Reconstruction: Nodes are sorted by in-degree (as a capability estimate); contradictory intra-SCC edges are removed.
- LoRA Fine-tuning: rank=8, 3 epochs, lr=1e-4, batch=16; teacher model is Qwen2.5-Max.
Loss & Training¶
Standard preference learning loss. Qwen2.5-7B-Instruct and LLaMA3.1-8B-Instruct serve as backbone models.
Key Experimental Results¶
Main Results (Cross-validated Non-Transitivity Rate, averaged over 5 test sets)¶
| Data Type | \(\rho_{\text{non-trans}}\)↓ | \(\tau_{\text{avg}}\)↓ |
|---|---|---|
| Raw | 64.3% | 0.811 |
| Cleaned | 50.5% | 0.723 |
| Δ | -13.8% | -0.088 |
Human Validation¶
| Metric | Cleaned | Discarded |
|---|---|---|
| Human Agreement | 52.6% | 34.4% |
| Model–Human Agreement | 80.6% | 51.2% |
Ablation Study¶
- Random filtering (removing an equivalent amount of data) leads to higher non-transitivity, confirming the necessity of targeted filtering.
- Cross-model validation: equally effective on LLaMA3.1 (Helpful_Base: 40.2% vs. 59.0%).
- ~80% of training data is retained — maximum quality gain at minimal data loss.
Key Findings¶
- Discarded data achieves only 34.4% human agreement, confirming these are genuinely low-quality samples.
- Non-transitive preferences occur more frequently among response pairs with quality differences below the just-noticeable difference (higher Self-BLEU).
- The cleaned model exhibits stronger discriminability on MT-bench (SD improvement of 2–4%).
Highlights & Insights¶
- Graph-theoretic treatment of preference data is an elegant design choice — SCCs precisely capture "mutually contradictory evaluation cycles."
- Validation of discarded data quality is highly convincing — 34.4% human agreement demonstrates that useful data is not being discarded, and model–human agreement drops from 80.6% to 51.2% (near random).
- Structural entropy provides a novel quantitative tool for preference evaluation, generalizable to any scenario requiring assessment of preference consistency.
- From a graph-theoretic perspective, non-transitive preferences are essentially "evaluation cycles" — SCCs are the canonical tool for capturing cycles, making the methodological choice highly natural.
Limitations & Future Work¶
- Repeated inference at multiple temperatures is required to construct the graph (computational overhead), which may be impractical for large-scale preference datasets.
- The quality of cleaned data is upper-bounded by the teacher model — inconsistent preferences in the teacher model will degrade cleaning effectiveness.
- Only binary preferences are handled; the approach is not extended to rating-based or multi-candidate ranking scenarios, despite the growing prevalence of K-wise comparisons in LLM evaluation.
- Using in-degree as a capability estimate for SCC→DAG reconstruction is a heuristic that may be inaccurate for highly imbalanced graphs.
- The assumption that non-transitivity primarily stems from data quality rather than the inherent complexity of evaluation tasks may not hold for subjective tasks where a total ordering does not exist.
- Filtering ~20% of training data reduces dataset size, which may cause excessive data loss in low-resource settings.
Related Work & Insights¶
- Directly relevant to RLHF data cleaning — non-transitivity in preference data may be a source of reward hacking, and cleaning preference data could fundamentally improve reward model quality.
- vs. Xu et al. (2025): They identify non-transitivity in GPT-4 evaluations but propose no solution; ELSPR provides a systematic graph-theoretic remedy.
- vs. Traditional Data Denoising Methods: General-purpose methods lack structural priors specific to preference data; ELSPR leverages the mathematical properties of tournament graphs for targeted filtering.
- vs. Canoe (AAAI 2026): Canoe addresses faithfulness (models failing to follow context), while ELSPR addresses consistency of the evaluator itself — the two are complementary, with the former improving evaluated models and the latter improving evaluators.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The combination of graph theory, structural entropy, and preference data cleaning is highly original; SCC analysis precisely captures non-transitive preferences.
- Experimental Thoroughness: ⭐⭐⭐⭐ Cross-validation across 5 test sets, human validation, and cross-model validation; rigorously designed.
- Writing Quality: ⭐⭐⭐⭐ Formally rigorous; the derivation from tournament graphs to DAGs is clearly presented.
- Value: ⭐⭐⭐⭐⭐ Directly applicable to improving the reliability of LLM evaluation systems and the quality of RLHF data.