Swap-guided Preference Learning for Personalized RLHF (SPL)¶

Conference: ICLR 2026
arXiv: 2603.12595
Code: https://github.com/cobang0111/SPL
Area: LLM Alignment / Personalized Alignment
Keywords: personalized reward model, posterior collapse, latent variable preference learning, swap-guided regularization, preference diversity

TL;DR¶

This paper addresses posterior collapse in Variational Preference Learning (VPL) by proposing SPL, which introduces swap-guided base regularization (forcing latent variables to encode user preferences rather than being ignored), a Preferential-IAF decomposition of swap-reversible and swap-invariant signals, and adaptive latent variable modulation. On Llama-3.1-8B, SPL achieves 63.71% accuracy and 97.10% active units, whereas VPL collapses to 57.14% accuracy and 0% active units.

Background & Motivation¶

Background: Unified reward models assume consistent preferences across all users, yet real-world user preferences exhibit significant diversity. VPL models user-specific preferences via latent variables.

Limitations of Prior Work: Under sparse data combined with a strong decoder, VPL's latent variables suffer from posterior collapse — the latent variable is entirely ignored and the model degenerates into a single reward model.

Core Idea: Swap-guided regularization forces the latent variables to remain informative, while IAF decomposes user-specific signals.

Method¶

Overall Architecture¶

SPL extends the Variational Preference Learning (VPL) framework: user preference data \(\mathbb{D}_h\) → encoder maps to latent variable \(z_0\) → P-IAF transforms it into a richer \(z_K\) → reward decoder outputs personalized reward \(r_\phi(x,y,z_K)\). The key innovation is leveraging swaps (exchanging chosen/rejected order) to construct virtual opposing users that guide the encoding process.

Key Designs¶

Swap-guided Base Regularization: For each user \(h\), a virtual opposing user \(h_{swap}\) is constructed by swapping all chosen/rejected preference pairs. The encoder is constrained to satisfy:
- Mean sign flip: \(\mu \approx -\mu_{swap}\) (reversed preference direction → reversed latent direction)
- Log-variance invariance: \(\ell \approx \ell_{swap}\) (uncertainty is unaffected by preference direction)
- Guidance loss: \(\mathcal{L}_{guide} = \mathbb{E}_h[\frac{1}{2}(1+\cos(\mu, \mu_{swap})) + \eta \frac{1}{2}(1-\cos(\ell, \ell_{swap}))]\)
- Preferential-IAF (P-IAF): The IAF context vector is decomposed into a swap-reversible component \(c_d = \frac{1}{2}(c - c_{swap})\) and a swap-invariant component \(c_s = \frac{1}{2}(c + c_{swap})\). \(c_d\) is fed exclusively into the shift function \(\mu_k\) (controlling preference direction), while \(c_s\) is fed exclusively into the scale function \(\sigma_k\) (controlling uncertainty), reducing cross-coupling. After \(K\) transformation steps, a multimodal \(z_K\) is obtained.
- Adaptive Latent Variable Modulation: A FiLM-style feature modulation mechanism dynamically adjusts the contribution weight of \(z_K\) to reward prediction based on signal strength — amplifying strong preference signals and attenuating uncertain ones.

Loss & Training¶

\(\mathcal{L}(\phi, \psi) = -\text{ELBO} + \lambda \mathcal{L}_{guide}\) - ELBO = expected preference likelihood \(- \beta \cdot D_{KL}[q_\psi(z_K|\mathbb{D}_h) || p(z_K)]\) - \(D_{KL}\) is computed efficiently via the Jacobian determinant of IAF - \(\mathcal{L}_{guide}\) imposes swap-mirror constraints on the base distribution \(z_0\)

Key Experimental Results¶

Posterior Collapse Diagnosis¶

Model	Method	Accuracy	Active Units↑	Collapse Status
Llama-3.2-3B	VPL	62.37%	88.22%	Mild collapse
Llama-3.2-3B	SPL	63.28%	93.07%	Healthy
Llama-3.1-8B	VPL	57.14%	0.00%	Complete collapse!
Llama-3.1-8B	SPL	63.71%	97.10%	Healthy

Cross-dataset Performance¶

Dataset	VPL KL Stability	SPL KL Stability	SPL Accuracy Gain
Pets (easy)	No collapse	No collapse	+0.5%
UF-P-2 (medium)	Partial collapse	No collapse	+2.1%
UF-P-4 (complex)	Complete collapse	No collapse	+6.6%

Key Findings¶

VPL completely collapses on the 8B model: active units drop from 88% to 0%; a stronger decoder bypasses the latent variable entirely
SPL eliminates collapse: active units remain at 97.10% even on the 8B model with complex data
Collapse correlates with model capacity: larger decoders are more prone to bypassing \(z\), making swap guidance more necessary
SPL is robust to \(\beta\): VPL is highly sensitive to the KL weight \(\beta\), while SPL remains stable over a wide range
P-IAF decomposition is effective: ablations show that removing the \(c_d/c_s\) decomposition reduces user specialization

Highlights & Insights¶

First report of posterior collapse in preference learning: while well-known in the VAE literature, this phenomenon had not been identified in preference modeling
Swap guidance is an elegant solution to posterior collapse — it leverages the natural symmetry of preference pairs (swapping chosen/rejected) to constrain the latent space structure
The swap-reversible/invariant decomposition in P-IAF carries clear physical meaning in gradient space — directional preference signals are decoupled from uncertainty signals
Adaptive modulation avoids overfitting caused by "forcing \(z\) to be used" — automatically falling back when the \(z\) signal is weak

Limitations & Future Work¶

Evaluation scenarios are limited; the personalization effect within actual RLHF training has not been verified (only preference prediction accuracy is assessed)
User preference types are defined using predefined categories (helpfulness, honesty, etc.), with no handling of continuous preference spectra
The selection of P-IAF step count \(K\) and \(\lambda\) relies on manual tuning, lacking an adaptive mechanism
No direct comparison with other personalized RLHF methods (e.g., multi-reward model aggregation) is provided

vs VPL (Poddar et al., 2024): SPL addresses the core deficiency of VPL (posterior collapse), making personalized reward models viable for larger models
vs VAE posterior collapse literature: collapse in preference learning has a distinctive cause — the decoder already obtains sufficient information from (prompt, response) pairs and has no need for \(z\)
vs multi-reward aggregation: SPL represents user preferences in a continuous latent space, offering greater flexibility than discrete multi-reward approaches
Insight: The central challenge of personalized alignment is not "how to model diverse preferences" but "how to ensure preference information is encoded into the latent variable"

Rating¶

Novelty: ⭐⭐⭐⭐ — swap-guided regularization and the P-IAF decomposition are conceptually novel
Experimental Thoroughness: ⭐⭐⭐ — validated across multiple models, but application scenarios are limited
Writing Quality: ⭐⭐⭐⭐ — the narrative arc from problem analysis (collapse → swap observation → method design) is clear and coherent
Value: ⭐⭐⭐⭐ — provides a practical, theoretically grounded solution for personalized alignment