ACL 2025 LLM (Other) Length Control Black-box LLMs Metropolis-Hastings Sampling Importance Sampling Iterative Inference Tuning-Free

Length Controlled Generation for Black-box LLMs¶

Conference: ACL 2025
arXiv: 2412.14656
Code: Not released
Authors: Yuxuan Gu, Wenjie Wang, Xiaocheng Feng, Weihong Zhong, Kun Zhu, Lei Huang, Tat-Seng Chua, Bing Qin
Institution: Harbin Institute of Technology, National University of Singapore, Peng Cheng Laboratory
Area: LLM/NLP / Controllable Text Generation
Keywords: Length Control, Black-box LLMs, Metropolis-Hastings Sampling, Importance Sampling, Iterative Inference, Tuning-Free

TL;DR¶

An iterative sampling framework based on the Metropolis-Hastings algorithm, integrated with an importance sampling acceleration strategy, is proposed to achieve precise length control for black-box LLMs without modifying model parameters. It achieves a 100% length control success rate on Llama3.1 in at most 5 iterations, without compromising generation quality.

Background & Motivation¶

Background: LLMs excel at instruction following, but precisely controlling the length of output text remains a difficult challenge. Subword tokenization and autoregressive decoding make it hard for models to accurately perceive and control the number of words.

Importance of Length Control: - Summary generation requires specific lengths to balance informativeness and conciseness. - Length bias introduced by preference alignment (RLHF/DPO) causes models to favor longer responses, affecting evaluation fairness (Singhal 2023). - In practical applications, users often need to specify response lengths (e.g., "summarize within 100 words").

Limitations of Prior Work: - Fine-tuning-based methods (Yuan et al. 2024; Wang et al. 2024) require modifying model parameters, which is computationally expensive and may damage general capabilities. - Reinforcement learning methods (Stiennon et al. 2020) also require training. - These methods cannot be applied to black-box API models (such as GPT-4).

Core Motivation: To design an inference-stage length control method that treats the LLM as an unmodifiable black-box component, activating its intrinsic length-following capability.

Method¶

Overall Architecture: Metropolis-Hastings Iterative Sampling¶

Length-controlled generation is modeled as a sampling problem from a target distribution \(\pi(y|x) \propto f(y)P(y|x)\), where: - \(P(y|x)\): The text generation probability distribution of the LLM. - \(f(y)\): The length constraint scoring function. - Goal: To find text that simultaneously satisfies length constraints and exhibits high generation quality.

Since direct sampling from \(\pi(y|x)\) is infeasible (the partition function is intractable), the Metropolis-Hastings algorithm from MCMC is utilized for iterative approximation.

Key Designs 1: Length Constraint Score \(f(y)\)¶

An NLTK word tokenizer is used to calculate the word count \(\text{Len}(y)\).
Precise length target: \(f(y) = 1 / |\text{Len}(y) - \ell|\), where smaller deviations yield higher scores.
Range length target: Within the interval \([\ell_1, \ell_2]\), \(f(y) = +\infty\) (immediate acceptance), and decreases with distance outside the interval.
Analogy to the Lagrangian method: \(\log f(y)\) acts as the constraint, and \(\log P(y|x)\) serves as the regularization objective.

Key Designs 2: LLM Probability Estimation (LLM-as-Judge)¶

Since \(P(y|x)\) cannot be directly obtained for black-box LLMs, a dual strategy is employed: - Absolute Scoring \(\phi(y|x)\): Prompting the LLM to score the generated text across multiple predefined dimensions. - Summarization task: information coverage, fluency, conciseness, logical coherence, faithfulness. - Instruction following: usefulness, relevance, accuracy, depth, creativity, detail level. - Pairwise Comparison \(\Phi(y_i, y_{i-1}|x)\): Directly comparing two candidates from adjacent iteration steps to reduce score variance.

Key Designs 3: Proposal Distribution and Importance Sampling Acceleration¶

Basic Proposal Distribution \(p(y_i|y_{i-1}, x)\): - A symmetric constraint \(p(y_i|y_{i-1}) = p(y_{i-1}|y_i)\) is imposed to simplify the calculation of the acceptance rate. - A "time-unbiased" prompt template is used: asking the LLM to generate a new variant by referring to the previous output.

Importance Sampling Acceleration \(q(y_i|y_{i-1}, x)\): - Problem: The basic proposal distribution does not contain updated length signals, allowing the LLM to potentially get stuck in its own errors. - Solution: Introducing an importance distribution that incorporates length constraints to replace the proposal distribution. - Adding length guidance such as "please rewrite within approximately \(\ell\) words" to the prompt. - At this point, the acceptance rate is bounded by the original acceptance rate: \(\mathcal{A} \leq \min(1, f(y_i)P(y_i|x) / f(y_{i-1})P(y_{i-1}|x))\). - Although this could theoretically increase the false acceptance rate, the strong generation capabilities of LLMs make this risk negligible in practice.

Algorithm Flow:

Initialization: \(y_0 \sim P(y|x)\) (LLM original output)
Iteration (up to \(n\) times):
- Generate candidate \(y_i\) via the importance distribution.
- Calculate acceptance rate \(\mathcal{A}(y_{i-1} \to y_i)\).
- Accept \(y_i\) with probability \(\mathcal{A}\); otherwise, retain \(y_{i-1}\).
Parallel sampling (beam search-style) is supported to enhance efficiency.

Key Experimental Results¶

Main Results: Precise Length Control (CNN/DailyMail Summary)¶

Model	Method	Accuracy↑	L1 Error↓	L2 Error↓	Rouge-1
Llama2	Inst	4.1%	11.42	15.20	0.37
Llama2	Ours	81.6%	0.24	0.64	0.36
Llama3.1	Inst	7.7%	3.88	5.10	0.38
Llama3.1	Ours	100.0%	0.00	0.00	0.38
GPT-4	Inst	15.7%	2.10	2.67	0.36
GPT-4	Ours	99.2%	0.01	0.12	0.36

Llama3.1 achieves 100% precise length control, with both L1 and L2 errors being 0.
Rouge metrics remain almost unchanged, verifying that length control does not compromise generation quality.

Main Results: Range Length Control (Alpaca-Eval-LI / MT-Bench-LI)¶

Dataset	Model	Inst Accuracy	Ours Accuracy	Win Rate Gain
Alpaca	GPT-4	37.2%	99.2%	30.2%→92.0%
Alpaca	Llama3	92.2%	99.8%	76.5%→83.5%
MT-Bench	GPT-4	54.7%	98.8%	27.4%→63.7%

GPT-4's Win Rate on Alpaca-Eval-LI improves from 30.2% to 92.0%.

Iteration Count Analysis (Llama3.1, CnnDM)¶

Iterations	Accuracy
0	7.7%
1	86.4%
2	99.2%
4	100.0%

A single iteration yields a substantial improvement, reaching perfection by 4 iterations.

Ablation Study¶

MH (without importance sampling): 40.2% accuracy \(\to\) MH+IS (with importance sampling): 93.3%.
Importance sampling is a key factor for performance improvement.
As beam size increases from 1 \(\to\) 16, accuracy increases from 24.6% \(\to\) 86.4% (Qwen2.5).
Parallel sampling effectively improves efficiency.

Highlights & Insights¶

Combining Classic and Modern: Innovatively applying the 1960s Metropolis-Hastings algorithm to modern LLM length control, with an elegant theoretical framework.
True Black-box Method: No access to model parameters or probability outputs is required; precise length control is achieved solely through API calls.
Almost Zero Degradation: Rouge scores are almost identical before and after control, proving that the method does not sacrifice content quality.
Extremely High Efficiency: 100% success rate is achieved within at most 5 iterations, rendering the computational overhead manageable.
Broad Applicability: Effective across both open-source (Llama series, Qwen) and closed-source (GPT-3.5/4) models.

Limitations & Future Work¶

API Call Cost: Each iteration requires multiple LLM calls (generation + scoring), leading to high costs under API-billing models.
Scoring Accuracy: Ratings from LLM-as-Judge can be inaccurate, especially for complex tasks.
Symmetry Assumption: The symmetry constraint of the proposal distribution is only approximately satisfied in practice.
Long Text Scenarios: The paper primarily evaluates performance on short texts (summarization); the effectiveness of controlling long texts (1000+ words) remains unknown.
Non-length Constraints: Although the framework can theoretically extend to other constraints (e.g., specific topics or formats), this paper only validates length constraints.

LLM Instruction Following: Studies such as Ouyang et al. 2022 (InstructGPT) and Zhou et al. 2024 enhance instruction-following capabilities.
Length Control: Early methods used special tokens to mark length (Fan et al. 2017) or convolutional block length factors (Liu et al. 2018); recent methods encode length signals into positional encodings, attention units, or natural language instructions (Yuan et al. 2024).
MCMC Applications in NLP: Existing work has applied MCMC to language generation (though not specifically for length control).

Rating¶

⭐⭐⭐⭐⭐ (5/5)

Novelty: ⭐⭐⭐⭐⭐ Applying the MH algorithm to LLM length control, featuring a clear and novel theoretical framework.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 6 models, 3 datasets, detailed ablations—extremely thorough data.
Writing Quality: ⭐⭐⭐⭐⭐ Rigorous mathematical derivations, with experimental logic progressing step-by-step.
Value: ⭐⭐⭐⭐⭐ Directly applicable to length control scenarios for any black-box LLM API.