ACL 2025 LLM (Other) LLM Evaluation Self-Consistency tree-based benchmark benchmark leakage round-trip transformation Machine Translation Code Generation

ConsistencyChecker: Tree-based Evaluation of LLM Generalization Capabilities¶

Conference: ACL 2025
arXiv: 2506.12376
Code: https://github.com/ulab-uiuc/consistencychecker
Area: LLM/NLP
Keywords: LLM Evaluation, Self-Consistency, tree-based benchmark, benchmark leakage, round-trip transformation, Machine Translation, Code Generation

TL;DR¶

ConsistencyChecker proposes a reference-free LLM evaluation framework based on self-consistency trees. By constructing reversible tree-like multi-step paths (such as multilingual round-trip translation and equivalent code rewriting), it quantifies the model's ability to maintain semantics or functionality during iterative transformations. Dynamically generating benchmarks eliminates data leakage at its root, and the framework achieves a correlation of \(r > 0.7\) with the authoritative WMT 2024 rankings, proving that LLM generalization capabilities can be reliably evaluated without paired data.

Background & Motivation¶

Severe benchmark leakage: LLM training datasets are massive and opaque. Prior studies have found that evaluation data in mainstream benchmarks like HumanEval overlaps with training data, leading to artificially inflated evaluation scores and untrustworthy rankings.

Limitations of traditional self-consistency methods: Existing self-consistency sampling only evaluates the consistency of multiple outputs from a single prompt, failing to capture the cumulative semantic drift across multi-step transformations—such as meaning shifts in translation or functional degradation in code.

Hard to scale fixed benchmarks: Relying on manual annotation or crawled data is costly and leads to narrow coverage. Low-resource languages and specialized domains often lack high-quality evaluation data.

Lack of evaluation methods for multi-step transformations: Multi-turn interaction scenarios (such as multilingual translation pipelines or iterative code generation) require evaluating consistency across multi-step operations, but most existing frameworks focus only on single-step output quality.

High evaluation cost: Authoritative evaluations like WMT rely on large-scale parallel corpora and manual annotations, which involve high computational overhead and deployment barriers, making frequent use difficult.

Insufficient generalizability: Existing evaluation methods are typically designed for specific tasks (e.g., BLEU for translation, pass@k for code) and lack a unified framework capable of covering both semantic and functional consistency.

Method¶

Overall Architecture¶

The core of the ConsistencyChecker framework is the construction of a Self-consistency Tree. The framework defines two roles: the evaluator (the model generating dynamic benchmark data) and the evaluatee (the model being evaluated). The evaluator generates the root node (initial text or code), and the evaluatee performs a series of reversible transformations on the root node to form a tree structure. By comparing the similarity between nodes at different depths of the tree and the root node, the model's consistency preservation capability is quantified.

Core Concept Definitions¶

Operation: A transformation \(f_p\) driven by a prompt and its inverse transformation \(f_{p'}\), where ideally \(f_{p'}(f_p(c)) \approx c\). For example, "English-to-French" and "French-to-English" form a pair of reversible operations.
Node: A tuple \(v = (c, \mathcal{I})\), where \(c\) is the generated content (text or code) and \(\mathcal{I}\) is the set of test inputs. In translation tasks, \(\mathcal{I}\) is an empty set; in programming tasks, it contains 20 sets of test cases.
Edge: Connects two nodes that share the same test inputs, representing the execution of a pair of reversible operations, i.e., \(c_j = f_{p'}(f_p(c_i))\).
Self-consistency Tree: \(\mathcal{T} = (\mathcal{V}, \mathcal{E})\), where the root node is the initial state generated by the evaluator, and each layer expands branches through all operation pairs, with the branching factor equal to the number of operation pairs.

Consistency Scoring System (Four-level Progression)¶

Node Pair Similarity \(\text{sim}(v_i, v_j)\): Computes the cosine similarity of execution outputs for two nodes (using NV-Embed-v2 embeddings) or BLEU scores.
Path-level Consistency \(C(P) = \text{sim}(v_1, v_n)\): The end-to-end similarity between the starting and ending nodes of a path, measuring the degree of information retention after \(n\) transformation steps.
Tree-level Consistency \(C_n(\mathcal{T})\): The average of all path consistencies at a given depth \(n\).
Forest-level Consistency \(C_n(\mathcal{F}) = \frac{1}{M}\sum_{m=1}^{M} C_n(\mathcal{T}_m)\): The average across \(M\) trees. The final metric is taken at \(n=3, M=10\).

Evaluation Task Design¶

Machine Translation: The root node is a 400-word English paragraph generated by the evaluator. Operation pairs are round-trip translations between English and French/Spanish/German (3 pairs of operations, tree branching factor = 3).
AI-assisted Programming: The root node consists of a LeetCode-Hard level coding problem generated by the evaluator along with 20 sets of test inputs. The operation is equivalent code rewriting (e.g., "replacing multiplication with loop summation" and its inverse operation). The execution time limit is 2 seconds per test case.

Experiments¶

Experimental Setup¶

Evaluator: Qwen-2.5-72B (generates 10 root nodes, shared across all evaluatees)
Evaluatee: GPT-4o-mini, Qwen-2.5 (1.5B/7B/14B/32B/72B), LLaMA-3.1 (8B/70B), totaling 8 models
Similarity Metric: NV-Embed-v2 embedding cosine similarity (main experiments); BLEU (discussion section)
ConsistencyChecker Configuration: \(n=3, M=10\)

Table 1: Main Results (ConsistencyChecker Scores)¶

Model	Translation Task ↑	Programming Task ↑
GPT-4o-mini	98.0±0.0	76.5±2.7
Qwen-2.5-1.5B	80.3±0.5	63.4±1.4
Qwen-2.5-7B	90.0±0.8	71.7±0.4
Qwen-2.5-14B	94.7±0.1	79.9±1.0
Qwen-2.5-32B	96.4±0.0	85.1±1.1
Qwen-2.5-72B	97.2±0.0	77.0±1.9
LLaMA-3.1-8B	67.5±3.0	60.4±1.0
LLaMA-3.1-70B	71.9±3.2	83.5±1.0

GPT-4o-mini performs best in the translation task; Qwen-2.5-32B performs best in the programming task. Within the same model series, larger models demonstrate higher consistency (Qwen-2.5-72B is 21.1% higher than 1.5B).

Table 2: Comparison with Authoritative WMT 2024 Metrics (Czech-Ukrainian Pair)¶

Model	ConsistencyChecker ↑	CometKiwi ↑	AutoRank ↓
Claude-3.5-Sonnet	98.1	68.3	1.7
GPT-4	96.4	67.7	2.0
Gemini-1.5-Pro	97.5	66.8	2.0
Mistral-Large	96.5	66.6	2.3
LLaMA-3-70B	95.9	66.1	2.6
Phi-3-Medium	44.9	42.5	9.1

The Pearson correlation coefficient exceeds 0.8, proving that ConsistencyChecker produces rankings highly consistent with authoritative metrics without utilizing any paired WMT data.

Table 3: Results of Using BLEU as a Similarity Metric (Evaluator: Qwen-2.5-72B)¶

Model	Translation \(C_1\)	Translation \(C_2\)	Translation \(C_3\)	Programming \(C_1\)	Programming \(C_3\)
GPT-4o-mini	86.0	78.6	68.0	78.4	48.3
Qwen-2.5-32B	78.4	68.2	57.3	79.2	64.3
LLaMA-3.1-8B	59.5	47.6	36.6	69.4	11.7

The Pearson correlation coefficient between BLEU and embedding methods in the programming task is as high as 0.98-0.99, demonstrating unexpected effectiveness.

Key Findings¶

Scale correlates with consistency: Within the same model series, larger parameter sizes yield higher consistency scores, a pattern observed in both translation and programming tasks (the Qwen-2.5 series shows monotonic increases).
Consistency decays with depth: When the path length increases from 1 to 3, consistency drops for almost all models; LLaMA-3.1-8B drops by 24.6% in the translation task, while Qwen-2.5-72B drops by only 1.9%.
Inconsistent rankings across translation and programming: GPT-4o-mini is best at translation but ranks fifth in programming, while Qwen-2.5-32B is strongest in programming but ranks fourth in translation, indicating that consistency is task-dependent.
Limited impact of the evaluator model: Using Qwen-2.5-7B or 72B as the evaluator yields virtually identical relative rankings, displaying the robust nature of the framework regarding evaluator choice.
BLEU is unexpectedly effective in programming tasks: Since code is typically either fully correct or entirely incorrect, n-gram overlap is sufficient for differentiation.
Weak models expose structural failures: Qwen-2.5-7B exhibits severe degradation in the French translation branch (similarity of only 0.33, output shortened to ~40 tokens); the tree structure effectively isolates such failure modes.

Highlights & Insights¶

Reliable evaluation without reference data: Completely avoiding parallel WMT data, the framework achieves a Pearson correlation coefficient \(> 0.8\) with CometKiwi and AutoRank, proving self-consistency is a reliable proxy metric for LLM capability.
Dynamic benchmarks resolve leakage at the source: Test data is generated on-the-fly by the evaluator LLM for each evaluation, theoretically allowing an infinite stream of fresh cases, fundamentally preventing training set contamination.
Tree structure provides diagnostic capabilities: Beyond outputting a scalar score, comparing branches exposes exactly where the model fails (e.g., "poor French translation but good German translation"), a diagnostic capability absent in single-score benchmarks.
Unified framework covering both semantics and functionality: Translation (semantic consistency) and code (functional consistency) are unified into "nodes as functions, edges as reversible operations," which can easily generalize to other reversible tasks like summarizing/expanding or compressing/decompressing.

Limitations & Future Work¶

Limited task coverage: Only machine translation and code generation tasks were evaluated, omitting other reversible tasks like arithmetic reasoning or summarization.
Insufficient model diversity: Only 8 models across 3 series were tested, lacking comparisons with recent models like Gemma and Mixtral.
Purely automatic metrics: Heavy reliance on embedding similarity and BLEU without incorporating human evaluation, which might miss creative nuances or fine-grained quality differences.
Applicable only to reversible transformations: The core premise relies on reversible operations, hindering direct evaluation of open-ended generation or creative writing, which lack clear inverse mappings.
Exponential complexity growth with depth: With a branching factor \(k\), a tree of depth \(n\) has \(k^n\) leaf nodes, making large-scale evaluations computationally expensive.

Round-trip translation consistency: van Zaanen & Zwarts (2006) proposed round-trip translation quality detection, but it was limited to a single round-trip with a single language pair. ConsistencyChecker generalizes this to tree-like, multi-path, multi-depth, cross-task setups.
Consistency evaluation frameworks: Divide-Conquer-Reasoning (Cui et al., 2024) and MT-Eval (Kwan et al., 2024) improve consistency via decomposed tasks and dynamic benchmarks, yet they lack tracking of multi-step transformation chains.
Formal verification: Traditional model checking and SMT solvers require domain specifications and massive computation. ConsistencyChecker approximates functional verification using set of test inputs, offering a more lightweight alternative.
Code evaluation benchmarks: HumanEval (Chen et al., 2021) and CodeXGLUE (Lu et al., 2021) rely on fixed datasets, risking leakage. The dynamic generation approach in this work bypasses this limitation.
WMT evaluation systems: CometKiwi and MetricX require extensive human-annotated parallel data, whereas ConsistencyChecker achieves a ranking correlation of \(r > 0.8\) at near-zero data cost.

Rating¶

Novelty: ⭐⭐⭐⭐ Tree-based reversible transformation evaluation enters as a brand-new paradigm, generalizing round-trip consistency from a single path to tree/forest structures.
Experimental Thoroughness: ⭐⭐⭐⭐ Tested with 8 models, two task types, compared against WMT benchmarks, with path-length/evaluator ablations and BLEU vs. embedding discussions.
Writing Quality: ⭐⭐⭐⭐ Rigorous conceptual definitions progressing from nodes to forests, paired with clear and intuitive diagrams.
Value: ⭐⭐⭐⭐ Offers a fundamental remedy to the benchmark leakage problem; the unified framework holds promise for extension to more task domains.