Towards Better Open-Ended Text Generation: A Multicriteria Evaluation Framework¶
Conference: ACL 2025 (workshop GEM²)
arXiv: 2410.18653
Code: GitHub
Area: Text Generation
Keywords: open-ended text generation, multicriteria evaluation, decoding strategies, Bradley-Terry model, text quality metric
TL;DR¶
To address the trade-off among multiple metrics (coherence/diversity/perplexity) in open-ended text generation, this paper proposes three complementary multi-criteria evaluation methods: the Extended Bradley-Terry model (ordinal ranking), Union-Free Generic Depth (partial ordering allowing incomparability), and Q*Text (cardinal comprehensive evaluation metric). Validated on over 1.8 million generated texts across 6 LLMs, 59 decoding strategies, and 3 datasets, the results show that moderate hyperparameter configurations generally outperform extreme ones, and smaller models with appropriate decoding strategies can match the performance of larger models.
Background & Motivation¶
Background: The output quality of LLMs depends not only on the model architecture but also on the decoding strategies used during inference (such as beam search, top-k/top-p sampling, and contrastive search). Existing evaluation methods largely rely on a single metric or human judgment.
Limitations of Prior Work: Decoding strategies inherently involve trade-offs among multiple metrics—optimizing coherence often sacrifices diversity, and vice versa. Evaluation using single metrics yields one-sided conclusions. Existing aggregation methods, such as the Pareto front, are uninformative for large-scale benchmarking, while weighted sums rely on arbitrary weight choices.
Key Challenge: How to establish a principled aggregation method among multiple conflicting automatic metrics to provide reliable rankings or scores for decoding strategies?
Goal: (a) Given multiple metrics, how to establish an ordinal ranking (allowing for incomparability)? (b) How to design a statistically grounded comprehensive metric for cardinal evaluation?
Key Insight: Distinguish between two practical scenarios—Scenario 1 (ranking only is needed \(\rightarrow\) use partial order theory) and Scenario 2 (quantifying differences is needed \(\rightarrow\) design a comprehensive metric)—each corresponding to a different approach.
Core Idea: Introduce depth functions from partial order theory and the Bradley-Terry model from statistics to text generation evaluation, combined with the proposed Q*Text comprehensive metric using a Gaussian penalty function to balance extreme values.
Method¶
Overall Architecture¶
Input: 6 LLMs (GPT2-XL to Falcon2-11B) \(\times\) 5 decoding strategies \(\times\) 59 hyperparameter configurations \(\rightarrow\) 1.8M+ generated texts, with each text evaluated on three metrics: coherence, diversity, and generation perplexity. Output: Rankings or scores of the decoding methods.
Key Designs¶
-
Extended Bradley-Terry Model (Scenario 1: Ranking)
- Function: Based on pairwise comparisons to establish a total order ranking of decoding methods.
- Mechanism: For each prompt, 354 decoding methods are compared pairwise (one method wins if it is not worse than the other in all three metrics, otherwise it is a tie). A GLM with a Poisson distribution is used to estimate the worth parameter \(\pi_i\) of each method: \(P(i > j) = \pi_i / (\pi_i + \pi_j + \nu\sqrt{\pi_i\pi_j})\).
- Design Motivation: Computationally efficient with \(O(n^2m)\) complexity, scaling well to large datasets; however, forcing a total order may oversimplify the relationships.
-
Union-Free Generic (UFG) Depth (Scenario 1: Partial Order Ranking)
- Function: Preserve partial order rankings that allow for incomparability.
- Mechanism: Treat the pairwise comparisons generated by each prompt as a partial order observation, and use a depth function to measure the "centrality" of each partial order—the partial order with the highest depth is the ranking structure most supported by the data.
- Design Motivation: Does not assume independence between comparisons and allows for incomparability between methods, though its worst-case computational complexity is \(O(2^m)\).
- Key Findings: The highest-depth partial order for the top four methods represents "completely incomparable" (depth = 0.977).
-
Q*Text (Scenario 2: Cardinal Evaluation)
- Function: Aggregate coherence, diversity, and perplexity into a single comprehensive score.
- Mechanism: \(\text{Q*Text} = \frac{\sum_{i=1}^3 w_i M_i P_i(M_i)}{\sum_{i=1}^3 w_i}\), where \(P_i(x) = \exp(-\alpha_i(x-\mu_i)^2)\) is a Gaussian penalty function—extreme deviations from the optimal target \(\mu_i\) are penalized.
- Parameter Optimization: The 9 parameters are optimized by maximizing the Spearman correlation \(\rho_s\) with human ratings.
- Design Motivation: The Gaussian penalty avoids degradation (such as extremely low diversity in beam search or gibberish generation) and automatically balances multiple metrics.
Key Experimental Results¶
Bradley-Terry Ranking (WikiText-103)¶
| Rank | Decoding Method | Worth Parameter |
|---|---|---|
| 1 | Mistral-7B CS (α=0.6, k=15) | 0.0469 |
| 2 | Mistral-7B CS (α=0.4, k=3) | 0.0374 |
| 3 | Mistral-7B CS (α=0.8, k=3) | 0.0346 |
| Worst | GPT2-XL CS (α=1.0, k=20) | Lowest |
Q*Text Case Analysis¶
| Decoding Method | Q*Text Score | Description |
|---|---|---|
| Human reference text | 87.33 | Human baseline |
| GPT2-XL CS (0.6, 5) | 86.69 | Small model + reasonable parameters \(\approx\) human |
| Mistral CS (0.4, 10) | 81.62 | Large model with moderate configuration |
| GPT2-XL CS (1.0, 20) | 0.02 | Extreme parameters \(\rightarrow\) degenerative gibberish |
| Llama3 beam (3) | 0.02 | beam search \(\rightarrow\) repetitive degradation |
Key Findings¶
- Contrastive Search with moderate parameters (\(\alpha = 0.4\) to \(0.6\), \(k = 5\) to \(15\)) is generally optimal—achieving the best balance between coherence and diversity.
- Beam Search is almost always the worst—its extremely low diversity leads to severe penalties from Q*Text.
- Small model + good strategy > Large model + poor strategy: GPT2-XL (1.5B) combined with CS(0.6,5) yields a Q*Text of 86.69, which is close to the human score of 87.33.
- The top-4 methods are actually incomparable: UFG depth reveals that the total order ranking of Bradley-Terry might be "forced."
- Stochastic methods prefer high diversity configurations: temperature \(\tau > 0.7\), top-k \(k > 10\), nucleus \(p > 0.8\).
Highlights & Insights¶
- Complementary design of three methods: Bradley-Terry (fast total ordering) \(\rightarrow\) UFG depth (preserving incomparability) \(\rightarrow\) Q*Text (cardinal evaluation) forms a complete evaluation toolkit. Different methods are chosen for different scenarios—practitioners can use Bradley-Terry to quickly select strategies, while researchers can use Q*Text to quantify differences.
- Gaussian penalty design of Q*Text: Using \(\exp(-\alpha(x-\mu)^2)\) to penalize extreme values avoids degradation. This is more principled than simple weighted sums and can automatically identify degenerative generation (repetition/gibberish scores 0). The Gaussian shape ensures that the middle range scores the highest, presenting an elegant solution for handling multi-metric trade-offs.
- Discovery that "top methods are actually incomparable": This serves as an important reminder to the NLP community's benchmark culture which often pursues a "single winner"—in most cases, the superiority of methods depends on which metric is prioritized.
- Decoding strategies are more important than model size: GPT2-XL (1.5B) + CS(0.6,5) achieves near-human performance, outperforming Llama3-8B + beam search by two orders of magnitude. This has direct guidance for deployment—optimize decoding strategies before considering switching to larger models.
- Experimental scale of 1.8 million generated texts: Covering 6 models \(\times\) 3 datasets \(\times\) 59 configurations, this is one of the largest-scale studies on decoding strategy evaluation to date.
Limitations & Future Work¶
- Only three automatic metrics: Excludes MAUVE (which requires aggregated data) and does not consider dimensions like factuality, safety, or fluency.
- UFG depth computational bottleneck: Its worst-case \(O(2^m)\) complexity limits its application to small subsets of methods (only 4 methods were compared in this paper).
- Maximum length of 256 tokens: Does not evaluate long-text generation scenarios, where coherence measurement is more complex.
- Model scope: The largest model evaluated is 11B; does not include 70B+ or GPT-4 level models.
- Generalizability of Q*Text parameters: The Gaussian penalty parameters \(\mu_i\) and \(\alpha_i\) depend on human ratings from the training data, requiring re-annotation and optimization for cross-domain applications.
- Instruction-following scenarios not considered: Only evaluates text continuation tasks; multi-criteria evaluation in chat scenarios may require different combinations of metrics (such as helpfulness, safety).
- Incomplete coverage of decoding strategies: Does not include recently popular strategies like speculative decoding or guided generation.
Related Work & Insights¶
- vs MAUVE (Pillutla et al., 2021): MAUVE is a distribution-level metric, whereas this paper requires instance-level metrics to construct partial orders; the two are complementary.
- vs Chatbot Arena (Chiang et al., 2024): Arena also uses the Bradley-Terry model to rank LLMs, while this paper extends it to decoding strategies, multi-criteria, and partial ordering.
- vs Contrastive Search (Su et al., 2022): CS performs best in this evaluation, but the optimal parameters vary across models/tasks.
- Insights: The design concept of the Q*Text penalty function can be transferred to other multi-metric evaluation scenarios (e.g., balancing faithfulness, relevance, and fluency in RAG evaluation).
Rating¶
- Novelty: ⭐⭐⭐⭐ Introducing partial order theory/depth functions to text generation evaluation is novel, and the Q*Text design is practical.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extremely large scale with 6 models \(\times\) 59 configurations \(\times\) 3 datasets \(\times\) 1.8M generations.
- Writing Quality: ⭐⭐⭐⭐ The framework is clear (two scenarios and three methods), though the mathematical section is slightly heavy.
- Value: ⭐⭐⭐⭐ Provides a systematic toolkit for text generation evaluation and offers practical guidance for selecting decoding strategies.