Reasoning Boosts Opinion Alignment in LLMs¶

Conference: ICLR 2026 arXiv: 2603.01214 Code: GitHub Area: Reinforcement Learning Keywords: opinion alignment, GRPO, political reasoning, survey data, digital democracy

TL;DR¶

GRPO-based reinforcement learning is applied to train LLMs to align with individual political opinions via structured reasoning. SFT+GRPO consistently outperforms ICL and ORPO baselines across U.S., German, and Swiss datasets, while systematically revealing left–right ideological asymmetry and fundamental difficulty in predicting Neutral stances.

Background & Motivation¶

Background: Modeling political opinions holds significant value for digital democracy. LLMs have been widely used to simulate group-level political preferences, primarily through demographic prompts (e.g., "You are a Democrat"), but this approach suffers from three major shortcomings: representativeness, controllability, and consistency.

Limitations of Prior Work: (1) Demographic prompts fail to capture individual-level preferences, as intra-group variance is substantial; (2) interview transcript–based methods (Park et al., 2024) are accurate but prohibitively expensive to collect; (3) political survey data (ANES/VAA) is abundant yet provides only stance labels without reasoning chains, requiring models to learn the reasoning process on their own.

Key Challenge: The statistical nature of LLMs and their limited causal understanding stand in tension with the requirement to faithfully represent diverse political opinions.

Goal: Can RL training enable LLMs to learn a "reason-then-answer" strategy that improves individual-level political opinion alignment?

Key Insight: Opinion formation is framed as a reasoning problem—drawing on GRPO's success in mathematical reasoning and transferring it to political reasoning scenarios.

Core Idea: Political survey data + GRPO rewarding correct stances + SFT warm-start for reasoning format = reasoning-based individual opinion alignment.

Method¶

Overall Architecture¶

Two-stage training: SFT → GRPO. A separate model is trained for each individual (voter, party, or candidate). No explicit persona representation is used; only a country label is provided in the system prompt, with preference alignment achieved implicitly through correct answer prediction.

Key Designs¶

Structured Reasoning Output Format
Function: Forces the model to reason before answering, using the format <reasoning>[reasoning text]</reasoning><answer>[stance]</answer>
Mechanism: Training data contains only stance labels without reasoning chains; the model must learn to generate reasoning under reward signal, with reasoning quality indirectly optimized through accuracy
Design Motivation: Explicit reasoning chains encourage the model to organize arguments systematically, avoiding ideological bias caused by intuitive pattern matching
Composite Reward Function
Function: Evaluates each generation along three dimensions: format correctness, length compliance, and stance correctness
Mechanism: \(R = \alpha_1 R_{\text{format}} + \alpha_2 R_{\text{length}} + \alpha_3 R_{\text{correct}}\), where \(R_{\text{format}}\) checks four XML tags (up to 4 points), \(R_{\text{length}} = -|L - L^*|\) penalizes deviation from target length, and \(R_{\text{correct}} = \mathbb{1}[y_i = y_i^*]\) awards 1 point for matching the survey answer
Design Motivation: \(\alpha_1=0.25, \alpha_2=0.01, \alpha_3=1.0\)—correctness carries the highest weight, format is secondary, and length serves only as a minor regularizer
SFT Warm-Start + Synthetic Argumentation Data
Function: Llama-70B is used to generate pro/con arguments for each policy question, constructing SFT data to teach the model the reasoning format
Mechanism: The SFT stage resolves format compliance issues (reducing the optimization burden of \(R_{\text{format}}\) in GRPO) while providing a reasonable initialization for political reasoning
Design Motivation: Direct GRPO training converges slowly (GRPO-only performs significantly worse than SFT+GRPO); SFT warm-starting substantially improves training dynamics

Loss & Training¶

GRPO (Group Relative Policy Optimization): For each prompt, a group of outputs is sampled, and intra-group reward normalization (subtracting the mean and dividing by the standard deviation) replaces the value function in traditional PPO for advantage estimation. LoRA fine-tuning (\(r=32, \alpha=32\)) with 4-bit quantization. SFT: 800 steps; GRPO: 800 steps; group size: 8; \(\beta=0\); temperature \(T=1.0\).

Key Experimental Results¶

Main Results (Macro-F1 %, 8 runs, T=1.0)¶

Method	smartvote (Switzerland)	WoM (Germany)	ANES (USA)
SFT+GRPO (Magistral 24B)	70.73	53.21	45.43
SFT (Magistral 24B)	67.63	51.86	39.15
GRPO only (Magistral 24B)	60.56	51.00	43.79
SFT+GRPO (Llama 3.1 8B)	66.88	52.53	40.66
ICL (Magistral 24B)	66.16	26.19	19.23
ORPO	23.31	24.73	24.25
Random	50.0	33.33	33.33

Ablation Study — Ideological Bias Analysis¶

Political Group	smartvote F1	WoM F1	ANES F1	Notes
Left	High	High	Relatively high	Easiest for the model to align
Center	Medium	High	Medium	Intermediate performance
Right	Low	Medium	Low	Systematically worst

Key Findings¶

SFT+GRPO is consistently optimal: It outperforms or matches SFT in 9/9 model×dataset combinations, with statistical significance (Welch t-test + Bonferroni correction)
Neutral is the hardest class: Neutral recall is lowest on ANES; the Neutral base rate correlates significantly with F1 at \(r=-0.59\); Right-leaning groups answer Neutral most frequently, leading to the greatest performance degradation
Reasoning reversal phenomenon: After training, models use similar arguments (e.g., "equal opportunity") to support opposing stances—reasoning content is semantically consistent but framed differently (see Table 1 examples)
Answer-flipping experiment: Reversing all smartvote answers before training improves F1 for Right candidates, yet still falls short of the original Left performance, suggesting that Left-leaning preferences may be intrinsically easier to model
PCA space shift: Trained agents shift toward the center-right and conservative direction in smartvote PCA space (opposite to the left-liberal bias reported in the literature), reflecting GRPO alignment rather than base model bias
Effect of SFT data bias: Progressively biased SFT data severely harms Right candidates without necessarily benefiting Left candidates, indicating that bias primarily damages the disadvantaged group

Highlights & Insights¶

Reframing political opinion alignment as a reasoning problem: Rather than relying on demographic proxies, the method enables models to "understand" each individual's stance through an explicit reasoning process—a conceptual paradigm shift
Validation across three countries and three political systems: smartvote (binary Yes/No), WoM (ternary + multi-election aggregation), and ANES (heterogeneous question formats requiring recoding)—demonstrating strong methodological generalizability
Deep insight into ideological asymmetry: Right-leaning preferences are systematically harder to learn, possibly due to imbalanced pretraining corpora or more complex statistical structure inherent to right-leaning positions
Asymmetric effect of SFT data bias: Bias harms the disadvantaged group more than it benefits the advantaged group—a cautionary finding for trustworthy AI system design

Limitations & Future Work¶

A separate model must be trained per individual, making computational cost \(O(N)\) and thus unscalable; future work should explore persona-conditioned single-model architectures
Test sets are very small (12–30 questions), limiting statistical confidence
The ternary classification {Yes, Neutral, No} discards fine-grained information from the original Likert scale
The choice of ANES recoding scheme (conservative vs. aggressive) affects results, indicating sensitivity to data preprocessing
The best F1 of ~70% leaves a substantial gap from a "faithful digital twin"
Zero-shot generalization from limited survey data to entirely new policy issues remains unexplored

vs. Santurkar et al. (2023) demographic prompting: Their work reveals that LLMs' default opinion distributions do not represent real populations; this paper bypasses demographic proxies entirely by aligning individuals directly from survey data
vs. Park et al. (2024) interview transcript modeling: Their approach achieves high accuracy using rich text to construct individual personas, but data acquisition costs are prohibitive; this paper uses structured survey data as a lightweight alternative
vs. DeepSeek-R1 (2025) GRPO for mathematical reasoning: GRPO's success in mathematical reasoning motivates its application here; this paper demonstrates its effectiveness in political reasoning, albeit with less pronounced gains than in the mathematical domain

Rating¶

Novelty: ⭐⭐⭐⭐ — Applying GRPO to political reasoning is a novel contribution; the ideological bias analysis is insightful
Experimental Thoroughness: ⭐⭐⭐⭐ — Covers 3 models × 3 datasets, ideological analysis, answer-flipping experiments, and SFT bias experiments
Writing Quality: ⭐⭐⭐⭐ — Well-structured; PCA visualizations and reasoning examples are persuasive
Value: ⭐⭐⭐ — Direction is promising but scalability is questionable; the finding on the difficulty of learning Right-leaning preferences carries social significance