Temperature as a Meta-Policy: Adaptive Temperature in LLM Reinforcement Learning¶
Conference: ICLR 2026 arXiv: 2602.11779 Code: None Area: LLM Reasoning Keywords: Temperature Scheduling, Meta-Policy, GRPO, Adaptive Exploration, Mathematical Reasoning
TL;DR¶
This paper proposes TAMPO (Temperature Adaptive Meta Policy Optimization), which reframes the sampling temperature as a learnable meta-policy. Through a bilevel loop — an inner loop for LLM policy optimization and an outer loop for adaptively updating the temperature distribution based on trajectory advantage signals — TAMPO requires no additional rollouts and consistently outperforms fixed-temperature baselines on mathematical reasoning benchmarks.
Background & Motivation¶
- Temperature is the core parameter governing the exploration–exploitation trade-off in LLM sampling.
- High temperature encourages diversity but introduces noise; low temperature improves focus but risks premature convergence.
- Existing RL training methods (e.g., GRPO) treat temperature as a fixed hyperparameter, ignoring the dynamic needs across training.
- Although entropy regularization and KL penalties also affect exploration, temperature directly modulates the sampling distribution and is more transparent and controllable.
- Core argument: Temperature should be a learnable decision variable rather than a manually tuned hyperparameter.
Method¶
Overall Architecture¶
TAMPO adopts a hierarchical bilevel loop structure:
- Inner loop: Generates rollouts using a selected temperature \(T_s\) and updates the LLM policy \(\pi_\theta\) via GRPO.
- Outer loop: Reuses inner-loop rollouts to update the temperature meta-policy \(\pi(T)\) based on trajectory advantage signals.
Key Designs¶
Each trajectory implicitly encodes its "preferred temperature" — the temperature under which that trajectory is most likely to be generated:
where \(\ell_T(\tau_i) = \frac{1}{|\tau_i|} \sum_{t=1}^{|\tau_i|} \log \pi_{\theta,T}(o_{i,t} | s_{i,t})\) denotes the average log-likelihood.
Temperature-Specific Advantage¶
For each trajectory \(\tau_i\) and virtual candidate temperature \(T_k\):
- Compute \(\ell_{T_k}(\tau_i)\): the likelihood of the trajectory under temperature \(T_k\).
- Apply sparsemax normalization to obtain \(\hat{\ell}_{T_k}(\tau_i)\) (summing to 1 across \(K\) candidate temperatures).
- Temperature-specific advantage: \(\mathcal{A}_i^{(T_k)} = \hat{\ell}_{T_k}(\tau_i) \cdot A_i\).
Intuition: - Trajectories with positive advantage → reinforce their most likely generating temperature. - Trajectories with negative advantage → penalize their most likely generating temperature.
Meta-Policy Update¶
- Batch aggregation: \(\mathcal{A}_\mathcal{B}^{(T_k)} = \frac{1}{|\mathcal{B}|G} \sum_b \sum_i \mathcal{A}_{b,i}^{(T_k)}\)
- EMA smoothing: \(\bar{\mathcal{A}}_s^{(T_k)} = (1-\alpha)\bar{\mathcal{A}}_{s-1}^{(T_k)} + \alpha \mathcal{A}_\mathcal{B}^{(T_k)}\)
- Min-max normalization to obtain a probability distribution: \(\pi_s(T_k) = \frac{\tilde{\mathcal{A}}_s^{(T_k)}}{\sum_j \tilde{\mathcal{A}}_s^{(T_j)}}\)
Temperature Sampling¶
Temperature is sampled from the meta-policy via nucleus sampling (top-p), with \(p=0.7\) providing the best exploration–exploitation balance.
Design Characteristics¶
- Zero additional rollouts: Fully reuses trajectory data from the inner loop.
- Non-differentiable optimization: Temperature is non-differentiable in LLM RL; TAMPO circumvents this via likelihood signals.
- Negligible overhead: The meta-policy maintains only a list of temperature advantages and is discarded at inference time.
Key Experimental Results¶
Main Results: Mathematical Reasoning Benchmarks (DS-Qwen-1.5B)¶
| Method | Average | AIME24 | MATH-500 | AMC23 | Minerva | OlympiadBench |
|---|---|---|---|---|---|---|
| DS-Qwen-1.5B (no RL) | 39.1 | 13.3 | 76.2 | 45.0 | 22.8 | 38.4 |
| GRPO (\(T_s\):0.9) | 42.0 | 20.0 | 75.2 | 50.0 | 26.1 | 38.7 |
| GRPO (\(T_s\):1.5) | 42.6 | 23.3 | 75.4 | 52.5 | 22.8 | 39.0 |
| GRPO (\(T_s\):0.9→1.5) | 42.8 | 16.7 | 76.6 | 55.0 | 24.6 | 41.0 |
| TAMPO | 44.5 | 23.3 | 76.8 | 55.0 | 27.9 | 39.6 |
Ablation Study: EMA Coefficient \(\alpha\)¶
| \(\alpha\) | Average | AIME24 | MATH-500 | AMC23 | Minerva | OlympiadBench |
|---|---|---|---|---|---|---|
| 0.01 | 41.6 | 20.0 | 75.2 | 50.0 | 25.4 | 37.5 |
| 0.05 | 44.5 | 23.3 | 76.8 | 55.0 | 27.9 | 39.6 |
| 0.10 | 43.6 | 23.3 | 75.4 | 57.5 | 23.2 | 38.8 |
Ablation Study: Meta-Policy Sampling Strategy¶
| top-p | Average |
|---|---|
| 0.9 | 43.0 |
| 0.7 | 44.5 |
| 0.5 | 42.2 |
| 0 (greedy) | 40.9 |
Cross-Task Generalization (Qwen2.5-3B-Instruct → ECQA)¶
| Method | Pass@1 | Pass@8 |
|---|---|---|
| No RL | 73.06% | 77.76% |
| GRPO | 75.07% | 78.94% |
| TAMPO | 76.12% | 79.67% |
Key Findings¶
- TAMPO outperforms the best fixed-temperature baseline by +1.9% (Pass@1) and +1.7% (Pass@8) on average.
- The learned temperature dynamics favor higher temperatures (~1.3) after warmup to encourage exploration, gradually decreasing as training proceeds.
- Greedy sampling (\(p=0\)) yields the worst results, indicating that exploration of the temperature space itself requires stochasticity.
- Training time is identical to the baseline (~9h54min on 8×V100).
- The method generalizes effectively to commonsense reasoning tasks.
Highlights & Insights¶
- Elevating temperature from a hyperparameter to a decision variable: a novel problem formulation.
- No additional rollouts required: Virtual temperature likelihood computation cleverly reuses existing trajectory data.
- Learned temperature policy aligns with intuition: A high-to-low exploration–exploitation transition emerges naturally.
- Fully compatible with existing RL methods: Can be plugged into GRPO, DAPO, REINFORCE++, and other critic-free approaches.
- Negligible computational overhead: Only \(K\) temperature advantage estimates need to be maintained.
Limitations & Future Work¶
- The candidate temperature set \(\mathcal{T}\) still requires manual specification of range and granularity.
- The unimodal property of trajectory likelihood with respect to temperature may not hold in all settings.
- Main experiments are conducted only on 1.5B models; validation on larger models is insufficient.
- The temperature meta-policy is shared across prompts; prompt-level adaptation remains unexplored.
Related Work & Insights¶
- Critic-free RL: GRPO, DAPO, REINFORCE++
- Exploration–exploitation: ε-greedy, temperature annealing, UCB, entropy regularization
- Meta-policy: MLSH (hierarchical RL), Meta-SAC (automatic entropy coefficient)
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — The formulation of temperature as a meta-policy is original.
- Technical Depth: ⭐⭐⭐⭐ — Theoretical derivations are clear, though the method itself is concise.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Five benchmarks with comprehensive ablations, but limited model scale.
- Practicality: ⭐⭐⭐⭐⭐ — Improves RL training with zero additional computational cost.