Temperature as a Meta-Policy: Adaptive Temperature in LLM Reinforcement Learning¶

Conference: ICLR 2026 arXiv: 2602.11779 Code: None Area: LLM Reasoning Keywords: Temperature Scheduling, Meta-Policy, GRPO, Adaptive Exploration, Mathematical Reasoning

TL;DR¶

This paper proposes TAMPO (Temperature Adaptive Meta Policy Optimization), which reframes the sampling temperature as a learnable meta-policy. Through a bilevel loop — an inner loop for LLM policy optimization and an outer loop for adaptively updating the temperature distribution based on trajectory advantage signals — TAMPO requires no additional rollouts and consistently outperforms fixed-temperature baselines on mathematical reasoning benchmarks.

Background & Motivation¶

Temperature is the core parameter governing the exploration–exploitation trade-off in LLM sampling.
- High temperature encourages diversity but introduces noise; low temperature improves focus but risks premature convergence.
Existing RL training methods (e.g., GRPO) treat temperature as a fixed hyperparameter, ignoring the dynamic needs across training.
Although entropy regularization and KL penalties also affect exploration, temperature directly modulates the sampling distribution and is more transparent and controllable.
Core argument: Temperature should be a learnable decision variable rather than a manually tuned hyperparameter.

Method¶

Overall Architecture¶

TAMPO adopts a hierarchical bilevel loop structure:

Inner loop: Generates rollouts using a selected temperature \(T_s\) and updates the LLM policy \(\pi_\theta\) via GRPO.
Outer loop: Reuses inner-loop rollouts to update the temperature meta-policy \(\pi(T)\) based on trajectory advantage signals.

Key Designs¶

Each trajectory implicitly encodes its "preferred temperature" — the temperature under which that trajectory is most likely to be generated:

\[T^* = \arg\max_{T_k \in \mathcal{T}} \ell_{T_k}(\tau_i)\]

where \(\ell_T(\tau_i) = \frac{1}{|\tau_i|} \sum_{t=1}^{|\tau_i|} \log \pi_{\theta,T}(o_{i,t} | s_{i,t})\) denotes the average log-likelihood.

Temperature-Specific Advantage¶

For each trajectory \(\tau_i\) and virtual candidate temperature \(T_k\):

Compute \(\ell_{T_k}(\tau_i)\): the likelihood of the trajectory under temperature \(T_k\).
Apply sparsemax normalization to obtain \(\hat{\ell}_{T_k}(\tau_i)\) (summing to 1 across \(K\) candidate temperatures).
Temperature-specific advantage: \(\mathcal{A}_i^{(T_k)} = \hat{\ell}_{T_k}(\tau_i) \cdot A_i\).

Intuition: - Trajectories with positive advantage → reinforce their most likely generating temperature. - Trajectories with negative advantage → penalize their most likely generating temperature.

Meta-Policy Update¶

Batch aggregation: \(\mathcal{A}_\mathcal{B}^{(T_k)} = \frac{1}{|\mathcal{B}|G} \sum_b \sum_i \mathcal{A}_{b,i}^{(T_k)}\)
EMA smoothing: \(\bar{\mathcal{A}}_s^{(T_k)} = (1-\alpha)\bar{\mathcal{A}}_{s-1}^{(T_k)} + \alpha \mathcal{A}_\mathcal{B}^{(T_k)}\)
Min-max normalization to obtain a probability distribution: \(\pi_s(T_k) = \frac{\tilde{\mathcal{A}}_s^{(T_k)}}{\sum_j \tilde{\mathcal{A}}_s^{(T_j)}}\)

Temperature Sampling¶

Temperature is sampled from the meta-policy via nucleus sampling (top-p), with \(p=0.7\) providing the best exploration–exploitation balance.

Design Characteristics¶

Zero additional rollouts: Fully reuses trajectory data from the inner loop.
Non-differentiable optimization: Temperature is non-differentiable in LLM RL; TAMPO circumvents this via likelihood signals.
Negligible overhead: The meta-policy maintains only a list of temperature advantages and is discarded at inference time.

Key Experimental Results¶

Main Results: Mathematical Reasoning Benchmarks (DS-Qwen-1.5B)¶

Method	Average	AIME24	MATH-500	AMC23	Minerva	OlympiadBench
DS-Qwen-1.5B (no RL)	39.1	13.3	76.2	45.0	22.8	38.4
GRPO (\(T_s\):0.9)	42.0	20.0	75.2	50.0	26.1	38.7
GRPO (\(T_s\):1.5)	42.6	23.3	75.4	52.5	22.8	39.0
GRPO (\(T_s\):0.9→1.5)	42.8	16.7	76.6	55.0	24.6	41.0
TAMPO	44.5	23.3	76.8	55.0	27.9	39.6

Ablation Study: EMA Coefficient \(\alpha\)¶

\(\alpha\)	Average	AIME24	MATH-500	AMC23	Minerva	OlympiadBench
0.01	41.6	20.0	75.2	50.0	25.4	37.5
0.05	44.5	23.3	76.8	55.0	27.9	39.6
0.10	43.6	23.3	75.4	57.5	23.2	38.8

Ablation Study: Meta-Policy Sampling Strategy¶

top-p	Average
0.9	43.0
0.7	44.5
0.5	42.2
0 (greedy)	40.9

Cross-Task Generalization (Qwen2.5-3B-Instruct → ECQA)¶

Method	Pass@1	Pass@8
No RL	73.06%	77.76%
GRPO	75.07%	78.94%
TAMPO	76.12%	79.67%

Key Findings¶

TAMPO outperforms the best fixed-temperature baseline by +1.9% (Pass@1) and +1.7% (Pass@8) on average.
The learned temperature dynamics favor higher temperatures (~1.3) after warmup to encourage exploration, gradually decreasing as training proceeds.
Greedy sampling (\(p=0\)) yields the worst results, indicating that exploration of the temperature space itself requires stochasticity.
Training time is identical to the baseline (~9h54min on 8×V100).
The method generalizes effectively to commonsense reasoning tasks.

Highlights & Insights¶

Elevating temperature from a hyperparameter to a decision variable: a novel problem formulation.
No additional rollouts required: Virtual temperature likelihood computation cleverly reuses existing trajectory data.
Learned temperature policy aligns with intuition: A high-to-low exploration–exploitation transition emerges naturally.
Fully compatible with existing RL methods: Can be plugged into GRPO, DAPO, REINFORCE++, and other critic-free approaches.
Negligible computational overhead: Only \(K\) temperature advantage estimates need to be maintained.

Limitations & Future Work¶

The candidate temperature set \(\mathcal{T}\) still requires manual specification of range and granularity.
The unimodal property of trajectory likelihood with respect to temperature may not hold in all settings.
Main experiments are conducted only on 1.5B models; validation on larger models is insufficient.
The temperature meta-policy is shared across prompts; prompt-level adaptation remains unexplored.

Critic-free RL: GRPO, DAPO, REINFORCE++
Exploration–exploitation: ε-greedy, temperature annealing, UCB, entropy regularization
Meta-policy: MLSH (hierarchical RL), Meta-SAC (automatic entropy coefficient)

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — The formulation of temperature as a meta-policy is original.
Technical Depth: ⭐⭐⭐⭐ — Theoretical derivations are clear, though the method itself is concise.
Experimental Thoroughness: ⭐⭐⭐⭐ — Five benchmarks with comprehensive ablations, but limited model scale.
Practicality: ⭐⭐⭐⭐⭐ — Improves RL training with zero additional computational cost.