Swap-guided Preference Learning for Personalized RLHF (SPL)¶
Conference: ICLR 2026
arXiv: 2603.12595
Code: https://github.com/cobang0111/SPL
Area: LLM Alignment / Personalized Alignment
Keywords: personalized reward model, posterior collapse, latent variable preference learning, swap-guided regularization, preference diversity
TL;DR¶
This paper addresses posterior collapse in Variational Preference Learning (VPL) by proposing SPL, which introduces swap-guided base regularization (forcing latent variables to encode user preferences rather than being ignored), a Preferential-IAF decomposition of swap-reversible and swap-invariant signals, and adaptive latent variable modulation. On Llama-3.1-8B, SPL achieves 63.71% accuracy and 97.10% active units, whereas VPL collapses to 57.14% accuracy and 0% active units.
Background & Motivation¶
Background: Unified reward models assume consistent preferences across all users, yet real-world user preferences exhibit significant diversity. VPL models user-specific preferences via latent variables.
Limitations of Prior Work: Under sparse data combined with a strong decoder, VPL's latent variables suffer from posterior collapse — the latent variable is entirely ignored and the model degenerates into a single reward model.
Core Idea: Swap-guided regularization forces the latent variables to remain informative, while IAF decomposes user-specific signals.
Method¶
Overall Architecture¶
SPL extends the Variational Preference Learning (VPL) framework: user preference data \(\mathbb{D}_h\) → encoder maps to latent variable \(z_0\) → P-IAF transforms it into a richer \(z_K\) → reward decoder outputs personalized reward \(r_\phi(x,y,z_K)\). The key innovation is leveraging swaps (exchanging chosen/rejected order) to construct virtual opposing users that guide the encoding process.
Key Designs¶
-
Swap-guided Base Regularization: For each user \(h\), a virtual opposing user \(h_{swap}\) is constructed by swapping all chosen/rejected preference pairs. The encoder is constrained to satisfy:
- Mean sign flip: \(\mu \approx -\mu_{swap}\) (reversed preference direction → reversed latent direction)
- Log-variance invariance: \(\ell \approx \ell_{swap}\) (uncertainty is unaffected by preference direction)
- Guidance loss: \(\mathcal{L}_{guide} = \mathbb{E}_h[\frac{1}{2}(1+\cos(\mu, \mu_{swap})) + \eta \frac{1}{2}(1-\cos(\ell, \ell_{swap}))]\)
- Preferential-IAF (P-IAF): The IAF context vector is decomposed into a swap-reversible component \(c_d = \frac{1}{2}(c - c_{swap})\) and a swap-invariant component \(c_s = \frac{1}{2}(c + c_{swap})\). \(c_d\) is fed exclusively into the shift function \(\mu_k\) (controlling preference direction), while \(c_s\) is fed exclusively into the scale function \(\sigma_k\) (controlling uncertainty), reducing cross-coupling. After \(K\) transformation steps, a multimodal \(z_K\) is obtained.
- Adaptive Latent Variable Modulation: A FiLM-style feature modulation mechanism dynamically adjusts the contribution weight of \(z_K\) to reward prediction based on signal strength — amplifying strong preference signals and attenuating uncertain ones.
Loss & Training¶
\(\mathcal{L}(\phi, \psi) = -\text{ELBO} + \lambda \mathcal{L}_{guide}\) - ELBO = expected preference likelihood \(- \beta \cdot D_{KL}[q_\psi(z_K|\mathbb{D}_h) || p(z_K)]\) - \(D_{KL}\) is computed efficiently via the Jacobian determinant of IAF - \(\mathcal{L}_{guide}\) imposes swap-mirror constraints on the base distribution \(z_0\)
Key Experimental Results¶
Posterior Collapse Diagnosis¶
| Model | Method | Accuracy | Active Units↑ | Collapse Status |
|---|---|---|---|---|
| Llama-3.2-3B | VPL | 62.37% | 88.22% | Mild collapse |
| Llama-3.2-3B | SPL | 63.28% | 93.07% | Healthy |
| Llama-3.1-8B | VPL | 57.14% | 0.00% | Complete collapse! |
| Llama-3.1-8B | SPL | 63.71% | 97.10% | Healthy |
Cross-dataset Performance¶
| Dataset | VPL KL Stability | SPL KL Stability | SPL Accuracy Gain |
|---|---|---|---|
| Pets (easy) | No collapse | No collapse | +0.5% |
| UF-P-2 (medium) | Partial collapse | No collapse | +2.1% |
| UF-P-4 (complex) | Complete collapse | No collapse | +6.6% |
Key Findings¶
- VPL completely collapses on the 8B model: active units drop from 88% to 0%; a stronger decoder bypasses the latent variable entirely
- SPL eliminates collapse: active units remain at 97.10% even on the 8B model with complex data
- Collapse correlates with model capacity: larger decoders are more prone to bypassing \(z\), making swap guidance more necessary
- SPL is robust to \(\beta\): VPL is highly sensitive to the KL weight \(\beta\), while SPL remains stable over a wide range
- P-IAF decomposition is effective: ablations show that removing the \(c_d/c_s\) decomposition reduces user specialization
Highlights & Insights¶
- First report of posterior collapse in preference learning: while well-known in the VAE literature, this phenomenon had not been identified in preference modeling
- Swap guidance is an elegant solution to posterior collapse — it leverages the natural symmetry of preference pairs (swapping chosen/rejected) to constrain the latent space structure
- The swap-reversible/invariant decomposition in P-IAF carries clear physical meaning in gradient space — directional preference signals are decoupled from uncertainty signals
- Adaptive modulation avoids overfitting caused by "forcing \(z\) to be used" — automatically falling back when the \(z\) signal is weak
Limitations & Future Work¶
- Evaluation scenarios are limited; the personalization effect within actual RLHF training has not been verified (only preference prediction accuracy is assessed)
- User preference types are defined using predefined categories (helpfulness, honesty, etc.), with no handling of continuous preference spectra
- The selection of P-IAF step count \(K\) and \(\lambda\) relies on manual tuning, lacking an adaptive mechanism
- No direct comparison with other personalized RLHF methods (e.g., multi-reward model aggregation) is provided
Related Work & Insights¶
- vs VPL (Poddar et al., 2024): SPL addresses the core deficiency of VPL (posterior collapse), making personalized reward models viable for larger models
- vs VAE posterior collapse literature: collapse in preference learning has a distinctive cause — the decoder already obtains sufficient information from (prompt, response) pairs and has no need for \(z\)
- vs multi-reward aggregation: SPL represents user preferences in a continuous latent space, offering greater flexibility than discrete multi-reward approaches
- Insight: The central challenge of personalized alignment is not "how to model diverse preferences" but "how to ensure preference information is encoded into the latent variable"
Rating¶
- Novelty: ⭐⭐⭐⭐ — swap-guided regularization and the P-IAF decomposition are conceptually novel
- Experimental Thoroughness: ⭐⭐⭐ — validated across multiple models, but application scenarios are limited
- Writing Quality: ⭐⭐⭐⭐ — the narrative arc from problem analysis (collapse → swap observation → method design) is clear and coherent
- Value: ⭐⭐⭐⭐ — provides a practical, theoretically grounded solution for personalized alignment