Limited Preference Data? Learning Better Reward Model with Latent Space Synthesis¶
Conference: NeurIPS 2025 arXiv: 2509.26074 Code: https://github.com/deeplearning-wisc/lens Area: LLM Alignment Keywords: reward modeling, latent space synthesis, VAE, preference data augmentation, RLHF
TL;DR¶
This paper proposes LENS, a framework that synthesizes preference data pairs in the latent space of LLM embeddings via a VAE, bypassing costly text generation and achieving substantial improvements in reward model performance at dramatically reduced computational cost (16,000× smaller model, 18× faster generation).
Background & Motivation¶
Reward modeling is central to aligning LLMs with human preferences, yet faces a severe data bottleneck:
High annotation cost: Preference data requires pairwise human comparisons, which are time-consuming and expensive to collect at scale.
High computational overhead of text-based synthesis: Conventional approaches require a two-stage pipeline—generating multiple responses with an LLM and then labeling preference pairs with an auxiliary LLM—whose complexity grows quadratically with the number of responses.
Resource-constrained settings: Small research labs and startups often cannot afford the inference costs of billion-parameter models.
Core Problem: Given limited preference data, can one efficiently scale the dataset to improve reward modeling?
The authors observe that LLM embedding spaces already encode rich semantic information, leading to a key insight: synthesizing data directly in the embedding space can bypass the computational bottleneck of text generation.
Method¶
Overall Architecture¶
The LENS (Latent EmbeddiNg for Synthesis) framework consists of three stages:
- Embedding extraction + VAE training: A variational autoencoder with a divergence loss is trained on response embeddings.
- Latent space synthesis: Synthetic preference pairs are generated via controlled perturbations in the learned latent space.
- Augmented training: The reward model is trained on the combined original and synthetic data.
Key Designs¶
Step 1: Embedding Extraction
Given a preference dataset \(\mathcal{D} = \{(x_i, y_i^+, y_i^-)\}_{i=1}^N\), the last-layer hidden states of an LLM are extracted as embeddings:
Step 2: VAE with Divergence Learning
The VAE encoder maps the \(d\)-dimensional LLM embeddings to Gaussian posterior parameters:
The standard VAE loss is:
Core Innovation — Divergence Loss: A Wasserstein distance term is introduced to maximally separate the latent distributions of positive and negative samples:
The total loss is:
where \(\gamma\) controls the weight of the divergence term (optimal value approximately 0.1).
Step 3: Latent Space Sampling
Synthetic embeddings are generated by adding Gaussian noise to latent vectors:
Top-k filtering retains the highest-likelihood synthetic samples, followed by combinatorial pairing to expand the dataset:
Loss & Training¶
The reward model uses a lightweight MLP trained in embedding space with a Bradley-Terry objective:
Theoretical Guarantees: - Theorem 1: Synthetic preference pairs preserve the original preference ordering under the optimal reward function, with error bounded by the noise level and VAE reconstruction quality. - Theorem 2: Reward models trained on augmented data achieve a tighter estimation error upper bound under standard regularity conditions.
Key Experimental Results¶
Main Results¶
Base model: Llama-3.1-8B-Instruct; 1,000 seed samples; evaluation metric: gold reward score under Best-of-N (\(N=16\)) sampling.
| Method | HH-RLHF (Orig) | HH-RLHF (4×) | HH-RLHF (8×) | TL;DR (Orig) | TL;DR (4×) | TL;DR (8×) |
|---|---|---|---|---|---|---|
| Fully fine-tune (text) | 1.49 | 1.78 | 1.93 | 0.69 | 0.97 | 1.23 |
| LoRA (text) | 1.28 | 1.52 | 1.61 | 0.57 | 0.92 | 1.15 |
| Embedding MLP (text) | 1.43 | 1.62 | 1.73 | 0.78 | 1.02 | 1.11 |
| Self-rewarding | 1.49 | 1.59 | 1.77 | 0.69 | 0.92 | 0.95 |
| Direct perturbation | 1.43 | 1.32 | 1.46 | 0.78 | 0.84 | 0.79 |
| Gaussian sampling | 1.43 | 1.12 | 0.94 | 0.78 | 0.53 | 0.43 |
| LENS (Ours) | 1.43 | 1.94 | 2.20 | 0.78 | 1.44 | 1.48 |
LENS outperforms all baselines across augmentation ratios and datasets. At 8× augmentation on HH-RLHF, it surpasses the strongest text-based baseline by 0.27 points.
Computational Efficiency (8× augmentation, HH-RLHF):
| Metric | Text Synthesis | Latent Space Synthesis | Reduction |
|---|---|---|---|
| Generation time | 3.6h | 0.2h | 18× |
| Model parameters | 8B | 0.5M | 16,000× |
| Total runtime | 5.2h | 0.4h | 13× |
Ablation Study¶
- Divergence loss weight \(\gamma\): \(\gamma=0.1\) is optimal; \(\gamma=0\) (no divergence) degrades performance; \(\gamma \geq 0.5\) causes over-separation and collapse.
- Synthesis noise \(\sigma^2\): \(\sigma^2=0.01\) is optimal (reward=1.96); too small (\(0.001\to1.63\)) yields insufficient exploration; too large (\(1.0\to1.51\)) corrupts preference relationships.
- Seed data scale: LENS at 4× augmentation consistently outperforms unaugmented data across seed sizes from 0.1k to 50k samples (0.1k: 0.93 vs. 0.68).
- Cross-model generalization: LENS is effective across Gemma-2B, Llama-3.2-3B, Mistral-7B, and Qwen-2.5-7B.
Key Findings¶
- Even without augmentation, an embedding-based MLP reward model approaches the performance of full fine-tuning, demonstrating that LLM embeddings already encode rich preference information.
- Naive latent-space baselines (direct perturbation, Gaussian sampling) perform poorly or even degrade, underscoring the importance of the structured latent representation learned by the VAE.
- In downstream SFT via rejection sampling, the LENS-trained reward model achieves a 61% win rate against text-synthesis-based models under GPT-4 evaluation (vs. 39%).
Highlights & Insights¶
- Remarkable efficiency gains: A 16,000× reduction in model size and 18× speedup practically resolve the computational bottleneck.
- Dual validation through theory and experiments: Theoretical guarantees on preference preservation are complemented by extensive empirical evidence.
- The latent space is the "right" domain for preference data manipulation: The strategy of bypassing text generation is elegant and offers a transferable principle for other data augmentation tasks.
- Divergence loss design: Explicitly separating the latent distributions of positive and negative samples via Wasserstein distance is better suited to preference learning than a standard VAE objective.
Limitations & Future Work¶
- Only MLP reward heads are evaluated: The combination of latent-space augmentation with fully fine-tuned reward models has not been tested.
- Dependence on seed data quality: If the initial 1,000 samples are not sufficiently representative, synthesis quality may be limited.
- Fixed VAE latent dimension of 16: The effect of different dimensionalities on models of varying scales remains unexplored.
- Evaluation is restricted to English preference data: Generalization to multilingual settings is unknown.
- No comparison with direct preference optimization methods such as DPO: Whether LENS-augmented data also benefits DPO training has not been investigated.
Related Work & Insights¶
- Comparison with Self-Rewarding: Self-Rewarding still incurs text generation overhead by relying on the model to generate and judge its own outputs; LENS operates entirely in embedding space.
- Connection to VOS (Du et al.): The idea of synthesizing features with a VAE for OOD detection from the same research group is transferred here to preference learning.
- Complementarity with Active Learning: LENS is a data augmentation approach and can be combined with active learning strategies for selective annotation.
- Broader implications: The framework is extensible to other tasks requiring pairwise comparison data, such as contrastive learning and learning to rank.
Rating¶
- Novelty: ⭐⭐⭐⭐ The idea of synthesizing preference data in latent space is novel, and the divergence VAE design is well-motivated.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Multiple datasets, models, and ablations, complemented by downstream SFT validation and theoretical analysis.
- Writing Quality: ⭐⭐⭐⭐ Clear structure, detailed method description, and rich figures and tables.
- Value: ⭐⭐⭐⭐ Practically addresses the computational cost problem and offers significant value for resource-constrained settings.