Limited Preference Data? Learning Better Reward Model with Latent Space Synthesis¶

Conference: NeurIPS 2025 arXiv: 2509.26074 Code: https://github.com/deeplearning-wisc/lens Area: LLM Alignment Keywords: reward modeling, latent space synthesis, VAE, preference data augmentation, RLHF

TL;DR¶

This paper proposes LENS, a framework that synthesizes preference data pairs in the latent space of LLM embeddings via a VAE, bypassing costly text generation and achieving substantial improvements in reward model performance at dramatically reduced computational cost (16,000× smaller model, 18× faster generation).

Background & Motivation¶

Reward modeling is central to aligning LLMs with human preferences, yet faces a severe data bottleneck:

High annotation cost: Preference data requires pairwise human comparisons, which are time-consuming and expensive to collect at scale.

High computational overhead of text-based synthesis: Conventional approaches require a two-stage pipeline—generating multiple responses with an LLM and then labeling preference pairs with an auxiliary LLM—whose complexity grows quadratically with the number of responses.

Resource-constrained settings: Small research labs and startups often cannot afford the inference costs of billion-parameter models.

Core Problem: Given limited preference data, can one efficiently scale the dataset to improve reward modeling?

The authors observe that LLM embedding spaces already encode rich semantic information, leading to a key insight: synthesizing data directly in the embedding space can bypass the computational bottleneck of text generation.

Method¶

Overall Architecture¶

The LENS (Latent EmbeddiNg for Synthesis) framework consists of three stages:

Embedding extraction + VAE training: A variational autoencoder with a divergence loss is trained on response embeddings.
Latent space synthesis: Synthetic preference pairs are generated via controlled perturbations in the learned latent space.
Augmented training: The reward model is trained on the combined original and synthetic data.

Key Designs¶

Step 1: Embedding Extraction

Given a preference dataset \(\mathcal{D} = \{(x_i, y_i^+, y_i^-)\}_{i=1}^N\), the last-layer hidden states of an LLM are extracted as embeddings:

\[\mathbf{e}_i^{\pm} = \text{LLM}_{\text{embed}}(x_i, y_i^{\pm})\]

Step 2: VAE with Divergence Learning

The VAE encoder maps the \(d\)-dimensional LLM embeddings to Gaussian posterior parameters:

\[q_\phi(\mathbf{z}|\mathbf{e}^+) = \mathcal{N}(\mathbf{z}; \boldsymbol{\mu}_\phi(\mathbf{e}^+), \boldsymbol{\sigma}_\phi(\mathbf{e}^+)^2 \cdot \mathbf{I})\]

The standard VAE loss is:

\[\mathcal{L}_{\text{VAE}}(\mathbf{e}) = \mathcal{L}_{\text{recon}}(\mathbf{e}, \hat{\mathbf{e}}) + \beta \cdot D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{e}) \| p(\mathbf{z}))\]

Core Innovation — Divergence Loss: A Wasserstein distance term is introduced to maximally separate the latent distributions of positive and negative samples:

\[\mathcal{L}_{\text{divergence}} = -\frac{1}{N}\sum_{i=1}^{N} W_2(q_\phi(\mathbf{z}^+|\mathbf{e}_i^+), q_\phi(\mathbf{z}^-|\mathbf{e}_i^-))\]

The total loss is:

\[\mathcal{L}_{\text{total}} = \frac{1}{N}\sum_{i=1}^{N}[\mathcal{L}_{\text{VAE}}(\mathbf{e}_i^+) + \mathcal{L}_{\text{VAE}}(\mathbf{e}_i^-)] + \gamma \cdot \mathcal{L}_{\text{divergence}}\]

where \(\gamma\) controls the weight of the divergence term (optimal value approximately 0.1).

Step 3: Latent Space Sampling

Synthetic embeddings are generated by adding Gaussian noise to latent vectors:

\[\hat{\mathbf{e}}_{i,j}^{\pm} = g_\theta(\mathbf{z}_i^{\pm} + \boldsymbol{\eta}_{i,j}^{\pm}), \quad \boldsymbol{\eta}_{i,j}^{\pm} \sim \mathcal{N}(0, \sigma_{\text{noise}}^2\mathbf{I})\]

Top-k filtering retains the highest-likelihood synthetic samples, followed by combinatorial pairing to expand the dataset:

\[\mathcal{E}_{\text{aug}} = \{(\tilde{\mathbf{e}}^+, \tilde{\mathbf{e}}^-) | \tilde{\mathbf{e}}^+ \in \mathcal{E}^+ \cup \mathcal{E}_{\text{synth}}^+, \tilde{\mathbf{e}}^- \in \mathcal{E}^- \cup \mathcal{E}_{\text{synth}}^-\}\]

Loss & Training¶

The reward model uses a lightweight MLP trained in embedding space with a Bradley-Terry objective:

\[\mathcal{L}_{RM}^{\mathcal{E}_{\text{aug}}} = -\mathbb{E}_{(\tilde{\mathbf{e}}^+, \tilde{\mathbf{e}}^-) \in \mathcal{E}_{\text{aug}}}[\log\sigma(r_o(\tilde{\mathbf{e}}^+) - r_o(\tilde{\mathbf{e}}^-))]\]

Theoretical Guarantees: - Theorem 1: Synthetic preference pairs preserve the original preference ordering under the optimal reward function, with error bounded by the noise level and VAE reconstruction quality. - Theorem 2: Reward models trained on augmented data achieve a tighter estimation error upper bound under standard regularity conditions.

Key Experimental Results¶

Main Results¶

Base model: Llama-3.1-8B-Instruct; 1,000 seed samples; evaluation metric: gold reward score under Best-of-N (\(N=16\)) sampling.

Method	HH-RLHF (Orig)	HH-RLHF (4×)	HH-RLHF (8×)	TL;DR (Orig)	TL;DR (4×)	TL;DR (8×)
Fully fine-tune (text)	1.49	1.78	1.93	0.69	0.97	1.23
LoRA (text)	1.28	1.52	1.61	0.57	0.92	1.15
Embedding MLP (text)	1.43	1.62	1.73	0.78	1.02	1.11
Self-rewarding	1.49	1.59	1.77	0.69	0.92	0.95
Direct perturbation	1.43	1.32	1.46	0.78	0.84	0.79
Gaussian sampling	1.43	1.12	0.94	0.78	0.53	0.43
LENS (Ours)	1.43	1.94	2.20	0.78	1.44	1.48

LENS outperforms all baselines across augmentation ratios and datasets. At 8× augmentation on HH-RLHF, it surpasses the strongest text-based baseline by 0.27 points.

Computational Efficiency (8× augmentation, HH-RLHF):

Metric	Text Synthesis	Latent Space Synthesis	Reduction
Generation time	3.6h	0.2h	18×
Model parameters	8B	0.5M	16,000×
Total runtime	5.2h	0.4h	13×

Ablation Study¶

Divergence loss weight \(\gamma\): \(\gamma=0.1\) is optimal; \(\gamma=0\) (no divergence) degrades performance; \(\gamma \geq 0.5\) causes over-separation and collapse.
Synthesis noise \(\sigma^2\): \(\sigma^2=0.01\) is optimal (reward=1.96); too small (\(0.001\to1.63\)) yields insufficient exploration; too large (\(1.0\to1.51\)) corrupts preference relationships.
Seed data scale: LENS at 4× augmentation consistently outperforms unaugmented data across seed sizes from 0.1k to 50k samples (0.1k: 0.93 vs. 0.68).
Cross-model generalization: LENS is effective across Gemma-2B, Llama-3.2-3B, Mistral-7B, and Qwen-2.5-7B.

Key Findings¶

Even without augmentation, an embedding-based MLP reward model approaches the performance of full fine-tuning, demonstrating that LLM embeddings already encode rich preference information.
Naive latent-space baselines (direct perturbation, Gaussian sampling) perform poorly or even degrade, underscoring the importance of the structured latent representation learned by the VAE.
In downstream SFT via rejection sampling, the LENS-trained reward model achieves a 61% win rate against text-synthesis-based models under GPT-4 evaluation (vs. 39%).

Highlights & Insights¶

Remarkable efficiency gains: A 16,000× reduction in model size and 18× speedup practically resolve the computational bottleneck.
Dual validation through theory and experiments: Theoretical guarantees on preference preservation are complemented by extensive empirical evidence.
The latent space is the "right" domain for preference data manipulation: The strategy of bypassing text generation is elegant and offers a transferable principle for other data augmentation tasks.
Divergence loss design: Explicitly separating the latent distributions of positive and negative samples via Wasserstein distance is better suited to preference learning than a standard VAE objective.

Limitations & Future Work¶

Only MLP reward heads are evaluated: The combination of latent-space augmentation with fully fine-tuned reward models has not been tested.
Dependence on seed data quality: If the initial 1,000 samples are not sufficiently representative, synthesis quality may be limited.
Fixed VAE latent dimension of 16: The effect of different dimensionalities on models of varying scales remains unexplored.
Evaluation is restricted to English preference data: Generalization to multilingual settings is unknown.
No comparison with direct preference optimization methods such as DPO: Whether LENS-augmented data also benefits DPO training has not been investigated.

Comparison with Self-Rewarding: Self-Rewarding still incurs text generation overhead by relying on the model to generate and judge its own outputs; LENS operates entirely in embedding space.
Connection to VOS (Du et al.): The idea of synthesizing features with a VAE for OOD detection from the same research group is transferred here to preference learning.
Complementarity with Active Learning: LENS is a data augmentation approach and can be combined with active learning strategies for selective annotation.
Broader implications: The framework is extensible to other tasks requiring pairwise comparison data, such as contrastive learning and learning to rank.

Rating¶

Novelty: ⭐⭐⭐⭐ The idea of synthesizing preference data in latent space is novel, and the divergence VAE design is well-motivated.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Multiple datasets, models, and ablations, complemented by downstream SFT validation and theoretical analysis.
Writing Quality: ⭐⭐⭐⭐ Clear structure, detailed method description, and rich figures and tables.
Value: ⭐⭐⭐⭐ Practically addresses the computational cost problem and offers significant value for resource-constrained settings.