Skip to content

A Multi-Persona Framework for Argument Quality Assessment

Conference: ACL 2025
Code: None
Area: Others
Keywords: Argument Quality Assessment, Multi-Perspective Evaluation, Large Language Models, Coarse-to-Fine Scoring, Persona Generation

TL;DR

This paper proposes the MPAQ framework, which simulates multiple distinct evaluator perspectives (personas) using Large Language Models to conduct multi-aspect quality assessment of arguments. It designs a coarse-to-fine scoring strategy (first integer, then decimal), significantly outperforming existing baselines on the IBM-Rank-30k and IBM-ArgQ-5.3k datasets while providing interpretable multi-perspective explanations.

Background & Motivation

Background: Argument Quality Assessment (AQA) is a core task in computational argumentation, aiming to automatically determine the quality of an argument. This task is inherently subjective—different evaluators may give vastly different scores to the same argument based on their backgrounds, perspectives, and interests. Existing datasets (e.g., IBM-Rank-30k) model this subjectivity by collecting opinions from multiple annotators, but most computational methods neglect the multi-perspective nature of evaluation during modeling.

Limitations of Prior Work: (1) Existing methods typically average the scores of multiple annotators as the training objective, which obliterates the differences between distinct evaluation perspectives—an argument might score high in logic but low in emotional appeal, and this nuance disappears after averaging. (2) Methods based on pre-trained models such as BERT lack interpretability regarding the evaluation process, failing to explain why a certain score was given. (3) Although direct scoring by LLMs offers some reasoning capabilities, a single-perspective evaluation still fails to capture the inherent multi-faceted nature of this task.

Key Challenge: The conflict between the "multi-faceted" nature of argument quality and the "single perspective" of evaluation models. Argument quality is multidimensional (logical rigor, sufficiency of evidence, emotional persuasiveness, clarity of expression, etc.), whereas existing evaluation methods either output a single scalar score or evaluate from only one perspective.

Goal: Design a framework that can simulate multiple different evaluator personas, with each persona assessing argument quality from a different angle, and ultimately aggregating these multi-perspective assessments to obtain a more comprehensive and robust quality score.

Key Insight: Leverage the role-playing capability of LLMs—by using carefully designed persona prompts, a single LLM is guided to "play" evaluators with different backgrounds and assess arguments from their respective professional standpoints. This captures multi-perspective differences while providing interpretable evaluation rationales through the reasoning capabilities of LLMs.

Core Idea: Dynamically generate targeted evaluator personas, steer the LLM to simulate the reasoning process of each persona for multi-perspective evaluation, and then convert qualitative judgments into precise numerical scores using a coarse-to-fine scoring strategy.

Method

Overall Architecture

The MPAQ framework consists of three core stages: (1) Persona Generation—dynamically generating multiple evaluator personas with different backgrounds and concerns based on the topic and content of the input argument; (2) Multi-Perspective Evaluation—guiding the LLM to evaluate the argument respectively from the standpoint of each persona, generating detailed reasoning processes and preliminary judgments; (3) Coarse-to-Fine Scoring—obtaining coarse-grained integer scores from each persona first, then refining them into fine-grained decimal scores, and finally aggregating all persona scores to obtain the final result.

Key Designs

  1. Dynamic Persona Generation Module:

    • Function: Automatically generate the most relevant and diverse evaluator personas for each argument to be assessed.
    • Mechanism: Upon inputting the topic and text of an argument, the LLM is leveraged to generate \(K\) different persona descriptions. Each persona includes: professional background (e.g., "legal professional," "data scientist," "social activist"), evaluation preferences (e.g., "focuses on logical rigor," "values evidence quality," "attends to emotional appeal"), and potential stance inclinations. During generation, diversity constraints are enforced so that different personas do not overlap in backgrounds and areas of focus. This dynamic generation is more flexible than predefined static personas, adaptively selecting the most relevant evaluation perspectives based on the argument's content.
    • Design Motivation: Static personas cannot adapt to the diversity of argument topics. Arguments regarding AI ethics and those regarding economic policies require completely different evaluation perspectives; dynamic generation ensures the relevance of the selected perspectives.
  2. Multi-Perspective Reasoning Evaluation Module:

    • Function: Guide the LLM to independently perform deep evaluation on the argument under the identity of each persona.
    • Mechanism: For each persona \(p_i\), a prompt containing the persona description and evaluation guidelines is constructed to guide the LLM to generate a detailed evaluation reasoning process. The reasoning process requires the LLM to start from the specific standpoint of the persona to analyze the strengths and weaknesses of the argument. For instance, when evaluating as a "logic professor," the LLM needs to focus on argumentative structure, the relationship between premises and conclusions, potential logical fallacies, etc.; when evaluating as an "emotional psychologist," it should attend to the emotional resonance, empathy triggers, and value appeals of the argument. Each persona's evaluation result contains a chain of reasoning and a preliminary quality judgment.
    • Design Motivation: Independent evaluation avoids mutual interference between different perspectives, ensuring that each persona's assessment is based on its unique viewpoint rather than being influenced by other personas. Detailed reasoning chains provide interpretability.
  3. Coarse-to-Fine Scoring Strategy:

    • Function: Convert qualitative evaluation reasoning into precise numerical scores.
    • Mechanism: This is carried out in two steps. First step (coarse-grained): based on the reasoning results, the LLM is asked to assign an integer score (e.g., 1-5 scale) to the argument, which provides clear decision boundaries (distinguishing "poor" and "medium", "medium" and "good"). Second step (fine-grained): further refine the score to a decimal (e.g., 3.7) within the integer interval, where the LLM needs to consider the relative position of the argument within that interval—how much better or worse it is than other arguments of the same tier. Finally, the decimal scores from all personas are averaged (or aggregated via other methods) to obtain the final score.
    • Design Motivation: Directly prompting LLMs to output precise decimal scores is highly challenging and unstable. By adopting a step-by-step strategy—first performing easy, coarse-grained classification and then making fine adjustments within a smaller range—the accuracy and stability of scoring are significantly improved.

Loss & Training

MPAQ is a prompt-based reasoning framework, and its core does not involve additional model training. All persona generation, reasoning evaluation, and scoring processes are completed through carefully designed prompt templates under zero-shot or few-shot settings of the LLM. Key parameters of the framework include the number of personas \(K\), the range of scoring granularity, and the weights of the aggregation strategy.

Key Experimental Results

Main Results

Dataset Metric MPAQ (GPT-4) GPT-4 Direct BERT-based SVM Gain
IBM-Rank-30k Pearson r 0.52 0.41 0.38 0.31 +26.8% vs GPT-4
IBM-Rank-30k Spearman ρ 0.49 0.39 0.36 0.29 +25.6% vs GPT-4
IBM-ArgQ-5.3k Pearson r 0.58 0.46 0.43 0.35 +26.1% vs GPT-4
IBM-ArgQ-5.3k Spearman ρ 0.55 0.44 0.40 0.33 +25.0% vs GPT-4

Ablation Study

Configuration IBM-Rank Pearson IBM-ArgQ Pearson Description
Full MPAQ 0.52 0.58 Multi-persona + Coarse-to-fine
Single persona 0.44 0.49 Lacks multi-perspective
Fixed persona (non-dynamic) 0.47 0.52 Static is inferior to dynamic
No coarse-to-fine (direct decimal scoring) 0.46 0.51 Unstable scoring
Coarse-grained only (integer) 0.48 0.53 Insufficient granularity
3 personas 0.50 0.55 Less than the optimal number
5 personas (Optimal) 0.52 0.58 Best configuration
7 personas 0.51 0.57 More is not necessarily better

Key Findings

  • Multi-persona evaluation achieves a significant improvement compared to single-perspective evaluation (+18% Pearson), demonstrating that multiple perspectives can better capture the multi-faceted nature of argument quality.
  • Dynamically generating personas outperforms fixed personas, suggesting that targeted evaluation perspectives are more effective for arguments with different topics.
  • The coarse-to-fine scoring strategy is a crucial contribution—removing the fine-grained adjustment drops Pearson by about 4 points, and removing the coarse-grained prior drops it by about 6 points, indicating that both steps of the phased used strategy are indispensable.
  • 5 personas is the optimal number; 3 is insufficiently diverse, whereas 7 introduces redundancy. This optimal number remains consistent across both datasets.
  • The evaluation rationales generated by MPAQ received highly rated quality in human evaluations, indicating that the framework not only scores accurately but also provides valuable interpretability.

Highlights & Insights

  • Systematically applying the role-playing capabilities of LLM to evaluation tasks is a novel and practical approach. By encouraging the LLM to "put itself in others' shoes" to simulate different evaluators, it ingeniously addresses the single-perspective limitation in subjective evaluation tasks. This methodology can be directly transferred to other subjective evaluation tasks, such as paper peer review, creativity assessment, and product evaluation.
  • Although the coarse-to-fine scoring strategy appears straightforward, it highly effectively resolves the instability issues associated with LLMs directly outputting precise numerical values. This technique is valuable for all scenarios requiring numerical scoring by LLMs.
  • Dynamic persona generation is more flexible than predefining them, but it also raises an interesting question—how to ensure that the generated personas are sufficiently diverse and cover all key evaluation dimensions of the arguments.

Limitations & Future Work

  • Relying entirely on LLM reasoning without any specialized training tailored for argument quality assessment may lead to suboptimal performance on arguments requiring highly specialized domain expertise.
  • The approach of independent evaluation and subsequent aggregation across multiple personas might ignore interactions and debates between them. In real review processes, discussions among evaluators often reveal new evaluation perspectives.
  • The computational cost is relatively high—each argument requires invoking the LLM multiple times (generating personas + \(K\) evaluations + \(K\) scorings), leading to significant API costs in bulk evaluation.
  • Currently evaluated only on English datasets; cross-lingual argument quality assessment remains an important direction for extension.
  • The current aggregation strategy uses simple averaging; exploring weighted aggregation based on persona reliability or expertise could further improve performance.
  • vs LLM-as-Judge (Direct Evaluation): Single LLM evaluation starts from only a single perspective, whereas MPAQ simulates the collective intelligence of multiple reviewers via multi-personas, aligning closer to real-world multi-reviewer scenarios. This is analogous to expanding a peer review panel from 1 reviewer to 5 reviewers.
  • vs BERT-based Methods: BERT-like models require massive annotated dataset training and lack interpretability; MPAQ achieves superior performance under zero-shot/few-shot settings while providing native reasoning explanations.
  • vs Multi-scoring-based Methods (Self-Consistency): Self-consistency reduces randomness by sampling the same model multiple times, but essentially remains repetitive sampling from the same perspective; MPAQ provides genuine multi-perspective variations through different personas.

Rating

  • Novelty: ⭐⭐⭐⭐ The framework design of multi-persona simulated evaluation is novel, and the coarse-to-fine scoring strategy is practical.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation on two standard datasets, detailed ablation studies, and complete sensitivity analysis of persona numbers.
  • Writing Quality: ⭐⭐⭐⭐ Clear description of methods and intuitive framework illustrations.
  • Value: ⭐⭐⭐⭐ Provides a generalized multi-perspective evaluation framework for subjective assessment NLP tasks, with broad application prospects.