Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework¶

Conference: ACL 2025
arXiv: 2502.18874
Code: None
Area: Others
Keywords: LLM evaluation, multi-faceted analysis, code-driven evaluation, fine-tuned evaluator, instruction following

TL;DR¶

Proposes the ARJudge evaluation framework, which adaptively generates evaluation criteria and executes text+code dual-driven analysis by fine-tuning an Analyzer, paired with a training-free Refiner for comprehensive judgment. ARJudge outperforms existing fine-tuned evaluators across multiple evaluation benchmarks, particularly achieving a performance gain of up to 11.1% on instruction-following evaluation via code-driven analysis.

Background & Motivation¶

Using LLMs to evaluate the output quality of other LLMs has become an important paradigm, but existing fine-tuned evaluators face two core issues:

Insufficient Adaptability of Predefined Criteria - Existing methods (e.g., Auto-J, Prometheus) use fixed, general evaluation criteria (e.g., "conciseness", "logical structure"). - These criteria fail to cover tasks requiring specific evaluation dimensions, such as creative writing or professional domains. - They exhibit poor generalization capability when facing unseen new instructions.

Instability in Evaluating Objective Constraints - LLMs behave unreliably when evaluating quantitative requirements (e.g., word count limits) and structural constraints (e.g., formatting requirements). - They struggle to accurately judge even basic text attributes (e.g., counting word frequency). - Pure text analysis has inherent limitations in objective verification.

The authors argue that building a robust evaluator requires two capabilities: adaptive generation of evaluation criteria (what to evaluate) + multi-faceted analysis (how to evaluate), especially by integrating code tools to handle objective requirements.

Method¶

Overall Architecture¶

ARJudge consists of two components: - Analyzer (Fine-tuned): Adaptively generates evaluation criteria and performs text-based as well as code-driven analysis. - Refiner (Training-free): Synthesizes the multi-faceted analysis results from the Analyzer to make the final judgment.

The training data originates from a meticulously constructed Composite Analysis Corpus.

Key Designs¶

Composite Analysis Corpus Construction
- Evaluation Criteria Generation: Two types of questions
  - Type 1 (for text analysis): Given an instruction + 3 sample responses, an LLM is used to generate 3 evaluation questions.
  - Type 2 (for code analysis): Objective constraints (e.g., word count limits) are added to instructions via self-instruct, generating corresponding evaluation questions.
- Text Analysis Collection: Given the instruction, two responses, and evaluation questions to the LLM, the model is required to perform comparative analysis; samples contradicting human annotations are filtered out.
- Code-driven Analysis Development:
  - Generates Python verification functions for objective evaluation questions using Claude-3.5-Sonnet.
  - The input to the function is the response text, and the output is the verification result.
  - Double filtering: execution testing + reverse verification (prompting the LLM to explain the code's purpose and check its consistency with the original question).
- Design Motivation: To organically combine "what to evaluate" and "how to evaluate" into a single training pipeline.
Analyzer Training
- Fine-tuned based on the Qwen2.5-7B-Instruct model.
- Approximately 25K training samples, containing three types of tasks: evaluation question generation, text-based analysis, and code generation.
- Employs different prompt templates to distinguish between the question generation and response analysis tasks.
- Text and code analyses share the same prompt template, with different modes triggered by starting prompts.
- Design Motivation: Multi-task training enables the model to learn to adaptively select the appropriate analysis paradigm.
Refiner Comprehensive Judgment
- Utilizes zero-shot inference of the same Qwen2.5-7B-Instruct model.
- Takes all analytical results from the Analyzer as input.
- Instructs the Refiner to revisit the instruction requirements and synthesize judgments on which response is superior.
- Design Motivation: To preserve the general model's macro-evaluation capability and compensate for any potential biases of the Analyzer.

Loss & Training¶

Uses standard supervised fine-tuning loss (next token prediction).
Generates analysis and judgments via greedy decoding (temperature=0).
Only the Analyzer is fine-tuned; the Refiner uses the identical model without fine-tuning.

Key Experimental Results¶

Main Results¶

Model	JudgeLM	PandaLM	Auto-J	MTBench	LLMBar	Average
GPT-4o	81.8	83.1	78.6	78.8	79.8	80.4
Claude-3.5-Sonnet	82.9	86.4	78.2	80.8	83.4	82.3
Auto-J-13B	77.9	77.2	79.7	75.0	27.8	67.5
Prometheus2-7B	76.5	76.3	75.1	74.3	41.5	68.7
ARJudge	81.0	82.4	78.5	78.3	68.2	77.7

ARJudge achieves the best performance among all fine-tuned evaluators, boosting accuracy by 26.7% on LLMBar compared to the best fine-tuned baseline Prometheus2.

Ablation Study¶

Setting	JudgeLM	PandaLM	Auto-J	MTBench	LLMBar
Qwen2.5-7B (Baseline)	80.0	80.7	73.8	75.2	52.6
ARJudge	81.0	82.4	78.5	78.3	68.2
-w/o Fine-tuning	73.1	75.6	68.7	70.0	62.5
-w/o Fine-tuning & Multi-faceted	74.7	72.2	65.6	67.8	63.7
-w/o Refine	81.7	82.8	79.6	79.1	63.7

Fine-tuning is the most critical component (performance drops globally without it), while the Refiner provides a significant boost for challenging samples (LLMBar).

Key Findings¶

Effectiveness of Code-Driven Analysis: On IFEval, the consistency of code-driven analysis far exceeds that of pure text-based analysis, yielding a 100% execution success rate for generated code.
Effect of Multi-Turn Analysis: Increasing the number of analysis turns has a positive effect on conventional test sets but may amplify uncertainty on challenging datasets like LLMBar.
Double-Edged Sword of Refinement: Under a fine-tuned Analyzer, the Refiner maintains performance, whereas under a non-fine-tuned Analyzer, the Refiner actually degrades performance (failure in self-correction).
Significantly Surpassing the Backbone Model: Compared to the baseline Qwen2.5-7B level, ARJudge achieves an average improvement of 15.6%.
Generalization on LLMBar: On the adversarially designed LLMBar, the performance is close to that of DeepSeek-V3 (80.4 vs 68.2).

Highlights & Insights¶

Unified Framework of "What and How to Evaluate": Rather than predefining criteria, the model adaptively generates evaluation dimensions.
Code as an Evaluation Tool: Introducing code execution into the evaluation pipeline overcomes the unreliability of LLMs in objective judgment.
Analyzer-Refiner Division of Labor: Combines fine-tuned expert analysis with untuned general synthesis, balancing both depth and breadth.
Reverse Verification Mechanism: Employs a two-step "explanation \(\rightarrow\) verification" process after code generation to ensure alignment between the code and evaluation objectives.

Limitations & Future Work¶

Limited to pairwise comparison evaluation; does not support scoring individual responses.
Tool usage is restricted to Python code, without considering other validation tools such as search engines or knowledge bases.
The Refiner relies heavily on the LLM's own reasoning capabilities; improvements are limited if the base capacity is insufficient.
Training data construction depends on GPT-4o and Claude, introducing cascading bias.
Code generation is confined to objective constraint verification, unable to handle subjective quality evaluations.

Compared to Auto-J's predefined criteria + text analysis, ARJudge introduces two additional dimensions: adaptive criteria generation and code analysis.
Compared to Prometheus's fixed evaluation templates, ARJudge's multi-faceted analysis is more flexible.
Insight: Code-driven analysis can be extended to more scenarios (e.g., integrating search engines for fact-checking, or calculators for mathematical reasoning).

Rating¶

Novelty: ⭐⭐⭐⭐ — Code-driven objective evaluation analysis offers a novel perspective, and the Analyzer-Refiner architecture has a well-structured hierarchy.
Experimental Thoroughness: ⭐⭐⭐⭐ — The evaluation is comprehensive, spanning 5 datasets, various baselines, ablation studies, and case studies.
Writing Quality: ⭐⭐⭐⭐ — The corpus construction process is clear and figures/tables are rich, though some sections are somewhat verbose.
Value: ⭐⭐⭐⭐ — Provides a practical framework for building more reliable LLM evaluators; the code-driven analysis approach has high generalizability.