Skip to content

Reasoning Boosts Opinion Alignment in LLMs

Conference: ICLR 2026 arXiv: 2603.01214 Code: GitHub Area: Reinforcement Learning Keywords: opinion alignment, GRPO, political reasoning, survey data, digital democracy

TL;DR

GRPO-based reinforcement learning is applied to train LLMs to align with individual political opinions via structured reasoning. SFT+GRPO consistently outperforms ICL and ORPO baselines across U.S., German, and Swiss datasets, while systematically revealing left–right ideological asymmetry and fundamental difficulty in predicting Neutral stances.

Background & Motivation

Background: Modeling political opinions holds significant value for digital democracy. LLMs have been widely used to simulate group-level political preferences, primarily through demographic prompts (e.g., "You are a Democrat"), but this approach suffers from three major shortcomings: representativeness, controllability, and consistency.

Limitations of Prior Work: (1) Demographic prompts fail to capture individual-level preferences, as intra-group variance is substantial; (2) interview transcript–based methods (Park et al., 2024) are accurate but prohibitively expensive to collect; (3) political survey data (ANES/VAA) is abundant yet provides only stance labels without reasoning chains, requiring models to learn the reasoning process on their own.

Key Challenge: The statistical nature of LLMs and their limited causal understanding stand in tension with the requirement to faithfully represent diverse political opinions.

Goal: Can RL training enable LLMs to learn a "reason-then-answer" strategy that improves individual-level political opinion alignment?

Key Insight: Opinion formation is framed as a reasoning problem—drawing on GRPO's success in mathematical reasoning and transferring it to political reasoning scenarios.

Core Idea: Political survey data + GRPO rewarding correct stances + SFT warm-start for reasoning format = reasoning-based individual opinion alignment.

Method

Overall Architecture

Two-stage training: SFT → GRPO. A separate model is trained for each individual (voter, party, or candidate). No explicit persona representation is used; only a country label is provided in the system prompt, with preference alignment achieved implicitly through correct answer prediction.

Key Designs

  1. Structured Reasoning Output Format
  2. Function: Forces the model to reason before answering, using the format <reasoning>[reasoning text]</reasoning><answer>[stance]</answer>
  3. Mechanism: Training data contains only stance labels without reasoning chains; the model must learn to generate reasoning under reward signal, with reasoning quality indirectly optimized through accuracy
  4. Design Motivation: Explicit reasoning chains encourage the model to organize arguments systematically, avoiding ideological bias caused by intuitive pattern matching

  5. Composite Reward Function

  6. Function: Evaluates each generation along three dimensions: format correctness, length compliance, and stance correctness
  7. Mechanism: \(R = \alpha_1 R_{\text{format}} + \alpha_2 R_{\text{length}} + \alpha_3 R_{\text{correct}}\), where \(R_{\text{format}}\) checks four XML tags (up to 4 points), \(R_{\text{length}} = -|L - L^*|\) penalizes deviation from target length, and \(R_{\text{correct}} = \mathbb{1}[y_i = y_i^*]\) awards 1 point for matching the survey answer
  8. Design Motivation: \(\alpha_1=0.25, \alpha_2=0.01, \alpha_3=1.0\)—correctness carries the highest weight, format is secondary, and length serves only as a minor regularizer

  9. SFT Warm-Start + Synthetic Argumentation Data

  10. Function: Llama-70B is used to generate pro/con arguments for each policy question, constructing SFT data to teach the model the reasoning format
  11. Mechanism: The SFT stage resolves format compliance issues (reducing the optimization burden of \(R_{\text{format}}\) in GRPO) while providing a reasonable initialization for political reasoning
  12. Design Motivation: Direct GRPO training converges slowly (GRPO-only performs significantly worse than SFT+GRPO); SFT warm-starting substantially improves training dynamics

Loss & Training

GRPO (Group Relative Policy Optimization): For each prompt, a group of outputs is sampled, and intra-group reward normalization (subtracting the mean and dividing by the standard deviation) replaces the value function in traditional PPO for advantage estimation. LoRA fine-tuning (\(r=32, \alpha=32\)) with 4-bit quantization. SFT: 800 steps; GRPO: 800 steps; group size: 8; \(\beta=0\); temperature \(T=1.0\).

Key Experimental Results

Main Results (Macro-F1 %, 8 runs, T=1.0)

Method smartvote (Switzerland) WoM (Germany) ANES (USA)
SFT+GRPO (Magistral 24B) 70.73 53.21 45.43
SFT (Magistral 24B) 67.63 51.86 39.15
GRPO only (Magistral 24B) 60.56 51.00 43.79
SFT+GRPO (Llama 3.1 8B) 66.88 52.53 40.66
ICL (Magistral 24B) 66.16 26.19 19.23
ORPO 23.31 24.73 24.25
Random 50.0 33.33 33.33

Ablation Study — Ideological Bias Analysis

Political Group smartvote F1 WoM F1 ANES F1 Notes
Left High High Relatively high Easiest for the model to align
Center Medium High Medium Intermediate performance
Right Low Medium Low Systematically worst

Key Findings

  • SFT+GRPO is consistently optimal: It outperforms or matches SFT in 9/9 model×dataset combinations, with statistical significance (Welch t-test + Bonferroni correction)
  • Neutral is the hardest class: Neutral recall is lowest on ANES; the Neutral base rate correlates significantly with F1 at \(r=-0.59\); Right-leaning groups answer Neutral most frequently, leading to the greatest performance degradation
  • Reasoning reversal phenomenon: After training, models use similar arguments (e.g., "equal opportunity") to support opposing stances—reasoning content is semantically consistent but framed differently (see Table 1 examples)
  • Answer-flipping experiment: Reversing all smartvote answers before training improves F1 for Right candidates, yet still falls short of the original Left performance, suggesting that Left-leaning preferences may be intrinsically easier to model
  • PCA space shift: Trained agents shift toward the center-right and conservative direction in smartvote PCA space (opposite to the left-liberal bias reported in the literature), reflecting GRPO alignment rather than base model bias
  • Effect of SFT data bias: Progressively biased SFT data severely harms Right candidates without necessarily benefiting Left candidates, indicating that bias primarily damages the disadvantaged group

Highlights & Insights

  • Reframing political opinion alignment as a reasoning problem: Rather than relying on demographic proxies, the method enables models to "understand" each individual's stance through an explicit reasoning process—a conceptual paradigm shift
  • Validation across three countries and three political systems: smartvote (binary Yes/No), WoM (ternary + multi-election aggregation), and ANES (heterogeneous question formats requiring recoding)—demonstrating strong methodological generalizability
  • Deep insight into ideological asymmetry: Right-leaning preferences are systematically harder to learn, possibly due to imbalanced pretraining corpora or more complex statistical structure inherent to right-leaning positions
  • Asymmetric effect of SFT data bias: Bias harms the disadvantaged group more than it benefits the advantaged group—a cautionary finding for trustworthy AI system design

Limitations & Future Work

  • A separate model must be trained per individual, making computational cost \(O(N)\) and thus unscalable; future work should explore persona-conditioned single-model architectures
  • Test sets are very small (12–30 questions), limiting statistical confidence
  • The ternary classification {Yes, Neutral, No} discards fine-grained information from the original Likert scale
  • The choice of ANES recoding scheme (conservative vs. aggressive) affects results, indicating sensitivity to data preprocessing
  • The best F1 of ~70% leaves a substantial gap from a "faithful digital twin"
  • Zero-shot generalization from limited survey data to entirely new policy issues remains unexplored
  • vs. Santurkar et al. (2023) demographic prompting: Their work reveals that LLMs' default opinion distributions do not represent real populations; this paper bypasses demographic proxies entirely by aligning individuals directly from survey data
  • vs. Park et al. (2024) interview transcript modeling: Their approach achieves high accuracy using rich text to construct individual personas, but data acquisition costs are prohibitive; this paper uses structured survey data as a lightweight alternative
  • vs. DeepSeek-R1 (2025) GRPO for mathematical reasoning: GRPO's success in mathematical reasoning motivates its application here; this paper demonstrates its effectiveness in political reasoning, albeit with less pronounced gains than in the mathematical domain

Rating

  • Novelty: ⭐⭐⭐⭐ — Applying GRPO to political reasoning is a novel contribution; the ideological bias analysis is insightful
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Covers 3 models × 3 datasets, ideological analysis, answer-flipping experiments, and SFT bias experiments
  • Writing Quality: ⭐⭐⭐⭐ — Well-structured; PCA visualizations and reasoning examples are persuasive
  • Value: ⭐⭐⭐ — Direction is promising but scalability is questionable; the finding on the difficulty of learning Right-leaning preferences carries social significance