Brevity is the soul of sustainability: Characterizing LLM response lengths¶

Conference: ACL 2025
arXiv: 2506.08686
Code: https://github.com/sohampoddar26/LLM-brevity
Area: LLM / Efficiency and Sustainability
Keywords: Response length, energy efficiency, prompt engineering, output compression, Green AI

TL;DR¶

This work systematically studies the response length behavior of 12 LLMs across 5 datasets. It finds that LLMs widely generate excessively verbose responses (with core answers accounting for only 42%). It proposes various prompting strategies that reduce response length by 25-88% and inference energy consumption by 25-60%, while maintaining or even improving the ROUGE-L F1 quality.

Background & Motivation¶

Background: LLM inference consumes substantial energy (e.g., ChatGPT processes over 1 billion queries daily, consuming an estimated 0.5 GWh). Inference optimization research has primarily focused on reducing the computation cost per token, such as through model compression, quantization, distillation, and speculative decoding. Output compression (reducing the number of generated tokens) remains largely unexplored systematically.

Limitations of Prior Work: - LLMs widely exhibit a "verbose bias", generating responses far longer than necessary. The authors' previous work demonstrated that inference energy consumption is highly and positively correlated with output length (as output is generated sequentially, unlike the input which can be parallelized and cached). - Simple fine-tuning experiments (LoRA, \(r=16\), 100 epochs) actually slightly increase response length (+1.24x), suggesting that the verbosity of LLMs is deeply rooted in pre-training. - Truncation strategies are also limited—in about 19% of the responses, the core answer is not at the beginning.

Key Challenge: Shorter responses = fewer tokens = lower energy consumption. However, how can responses be shortened without sacrificing quality? Additional information (explanations, examples, politeness) may enhance user experience but incurs extra energy cost—this is a trade-off that has not been previously quantified.

Goal: (1) How much redundant text do LLMs actually generate, and what does this redundant content consist of? (2) Can simple prompting strategies effectively control the output length? (3) What is the specific impact of length reduction on both quality and energy consumption?

Key Insight: An analogy of "Economy vs. Business Class"—enhanced experiences beyond the core demand (arriving on time / obtaining the answer) come with additional costs. This work is the first to systematically quantify this trade-off.

Core Idea: "Brevity is the soul of sustainability"—using prompt engineering to control output length is the simplest and zero-cost approach to saving energy during inference.

Method¶

Overall Architecture¶

Three stages: (1) Benchmark response length vs. target length across 12 LLMs × 5 datasets; (2) Annotate information categories of LLM responses, defining 6 categories and analyzing their distribution; (3) Design various prompting strategies to compress output and evaluate the impact on length, quality, and energy consumption.

Key Designs¶

Response Length Benchmark (12 LLMs × 5 Datasets):
- Function: Compare the length generated by LLMs against the target answer length.
- Mechanism: Five factual QA datasets (Dolly, GooAQ, MS-MaRCo, NarrativeQA, TweetQA) are selected, covering diverse task types, domains, and answer formats.
- Key Findings: The ratio of LLM response length to target length can be categorized into three tiers—moderate (1-3x, GPT-3.5), long (3-10x, GPT-4, Gemma, LLaMA-2, Mistral), and extremely long (>10x, LLaMA-3, Phi-3). For reasoning models (DeepSeek-R1), thinking/reasoning tokens account for 64-74% of the responses.
6-Category Information Annotation Framework:
- Function: Define and annotate 6 information categories in LLM responses.
- Mechanism: MinAns (minimal/core answer), AddInfo (additional info), Explain (reasoning/explanation), Convers (conversational markers/politeness), RedInfo (redundancy/repetition), and Irrel (irrelevant/hallucination).
- Key Findings: The core answer averages only 42% of the content, while irrelevant information accounts for around 18%, and conversational filler takes up roughly 5.2%—reducing the latter categories yields immediate energy savings. Inter-annotator agreement (F-measure) is 0.764.
- Design Motivation: It is necessary to identify "what" the redundant text contains in order to design targeted compression strategies.
6 Prompt Engineering Strategies:
- BRIEF: Appends "Answer briefly" to the end of the query.
- BM25-InContext: Retrieves 10 similar Q&A pairs using BM25+ as in-context examples to guide the LLM to learn appropriate response lengths.
- Limit-Len (3 variants): Specifies "Answer within X words"—where X is derived from the BM25 retrieved median length (BM25-length), the ground-truth answer length (GoldResLen, oracle), or a trained length predictor (PredResLen, a DeBERTa-v3-large regression model).
- Limit-Cat (2 variants): MinAns (provide only the core answer), and MAddNoRed (core answer + additional informative content, avoiding redundancy and politeness).

Evaluation Metrics¶

Response length (token count), ROUGE-L F1 (matching with target answers), and inference energy consumption (measured in mWh using CodeCarbon).
Distribution shift of information categories.

Key Experimental Results¶

Main Results: Strategy Effectiveness Comparison (Averaged Across All Models × Datasets)¶

Strategy	Length Reduction	Energy Reduction	ROUGE-L F1 Change
MinAns	~60%	~28%	Highest Improvement
PredResLen	~53%	~26%	Comparable to oracle
GoldResLen (oracle)	~50%	~26%	Good
BRIEF	~38%	Moderate	Improved
BM25-length	~38%	Lower (computation overhead)	Slightly improved
MAddNoRed	Less	Moderate	Moderate
BM25-InContext	Least	Ineffective (due to input overhead)	May drop

LLM Verbosity Taxonomy¶

Type	Representative Models	Generated/Target Length Ratio
Moderate (1-3x)	GPT-3.5	Relatively concise responses
Long (3-10x)	GPT-4, Gemma-2, LLaMA-2, Mistral, Vicuna	Moderately verbose
Extremely long (>10x)	LLaMA-3.1, Phi-3	Highly verbose

Information Category Distribution¶

Category	Proportion	Description
MinAns (Core Answer)	~42%	Segment that directly answers the question
AddInfo (Additional Info)	~21%	Extra background or context
Irrel (Irrelevant Info)	~18%	Hallucinations or digressions
Explain (Reasoning/Explanation)	~11.5%	Step-by-step thinking demonstration
Convers (Conversational)	~5.2%	Polite phrases like "Let me know if..."
RedInfo (Redundancy)	~2%	Repeating the same information

Key Findings¶

Simplest strategy is the most effective: MinAns, by simply appending "Only provide the minimal answer", achieves the highest length compression and energy savings. Crucially, the ROUGE-L F1 score improves because noise is reduced, leading to significantly higher precision.
Supervised length prediction matches oracle performance: PredResLen (predicting ideal target length via DeBERTa) even outperforms the gold-truth length strategy on certain models (66% for Mistral, 77% for LLaMA-2, 69% for Gemma-2).
Verbosity is consistent within model families: Different sizes of the same model family exhibit similar information category distributions, suggesting that pre-training strategies are the primary driver of verbosity.
Newer models are more verbose than older ones: LLaMA-3.1 and GPT-4 generate more explanations and additional info; older models like LLaMA-2 and GPT-3.5 are more inclined to directly emit the core answer.
Response length correlates linearly with energy consumption: Saving every single token translates directly into computational savings.

Highlights & Insights¶

Highly practical findings: Generating energy savings of 25-60% requires no modifications to the model architecture, code, or hardware—only simple prompt modifications. This holds immediate operational value for API providers and enterprise users.
First systematic quantification of "verbose bias": The 6-category information annotation framework transforms the intuitive understanding of LLM verbosity into quantifiable data—making it clear that core answers take up only 42% of the content while nearly 60% is auxiliary generation.
"Economy vs. Business Class" analogy: It contextualizes and structures the trade-off between the cost (energy consumption) and value (user experience) of enhanced information generated by LLMs. This addresses an overlooked perspective in Green AI.
Orthogonality of output and model compression: While model compression scales down the per-token computational cost, output compression reduces the absolute number of generated tokens. The two approaches are orthogonal and fully complementary.

Limitations & Future Work¶

Only tested on factual QA tasks: Long-text tasks such as code generation and creative writing are not suitable for aggressive compression, necessitating task-specific strategies.
ROUGE-L is an imperfect metric for quality: It cannot entirely replace human evaluation of user satisfaction.
Energy measurements depend on hardware and batching: Different GPUs and batch sizes can shift absolute energy utilization metrics.
In-training length control was not explored: Methods such as introducing length penalties or learning brief-response preferences during RLHF were not investigated.
Inconsistent adherence to category control prompts in some LLMs: Phi-3-small and LLaMA-3-8B showed weaker instruction-following capabilities on the MAddNoRed prompt.
Fine-tuning paradoxically increases length: Simple LoRA fine-tuning failed to shorten responses, pointing to the need for deeper investigation into specialized training strategies.

vs. Model Compression/Quantization: Pruning, quantization, and distillation reduce the compute cost per token—which is fully orthogonal and complementary to output compression.
vs. Li et al. (2024): While prior work simply proposed a singular "Answer briefly" baseline, this study systemically evaluates 6+ strategies and defines a 6-category information taxonomy key to diagnostic analysis.
Insights: API providers could leverage concise prompt templates (such as MinAns) by default to slash system-wide costs and carbon emissions, reserving elaborated modes strictly for user-opted scenarios.

Rating¶

Novelty: ⭐⭐⭐ The observation of verbose bias is novel and quantifiable, though the technique relies on simple prompt engineering. The 6-category taxonomy is a solid contribution.
Experimental Thoroughness: ⭐⭐⭐⭐ Extensive settings across 12 models, 5 datasets, and 6+ strategies, featuring multi-dimensional analysis on length, quality, energy, and information dynamics.
Writing Quality: ⭐⭐⭐⭐ Clever title, elegant "Economy vs. Business class" analogy, and systematic execution.
Value: ⭐⭐⭐⭐ Highly applicable for Green AI guidelines; the 6-category framework is readily reusable for subsequent diagnostics.