Skip to content

Beyond In-Context Learning: Aligning Long-form Generation of LLMs via Task-Inherent Attribute Guidelines

Conference: ACL 2025
arXiv: 2506.01265
Institution: National University of Singapore, NTU, ASTAR, Salesforce AI Research
Area: LLM NLP / Long-form Generation Alignment
Keywords*: in-context learning, Long-form Generation, LongGuide, Metric Guidelines, Output Constraints

TL;DR

Proves both theoretically and experimentally that ICL exemplars fail to fully transfer linguistic and formatting attributes of a task. It proposes the LongGuide algorithm to automatically learn Metric Guidelines (MG) and Output Constraint Guidelines (OCG) from a small amount of training data, achieving an average improvement of over 5% ROUGE-L across 7 long-form generation tasks.

Background & Motivation

Background: In-context learning (ICL) is one of the most vital capabilities of LLMs, calibrating model behavior by providing a few exemplars in the prompt. While ICL is highly effective for classification tasks, its performance is limited in long-form generation tasks (e.g., summarization, translation, dialogue generation). Furthermore, existing theoretical analyses assume that the model perfectly captures the task's language distribution \(P_\mathcal{M}(X) = P_T(X)\), an assumption that does not hold in practice.

Limitations of Prior Work: Even when provided with 5 exemplars that have perfect attribute scores (5/5), only 4%–44% of the outputs generated by ICL models maintain the same attribute scores. Increasing the number of exemplars (from 3 to 5 to 10) fails to resolve this issue. Models cannot implicitly learn linguistic features (e.g., conciseness, informativeness) and formatting features (e.g., sentence count, token count) from exemplars. The authors term this the Property Transfer (PT) problem.

Theoretical Insights: The authors prove that when \(P_\mathcal{M} \neq P_T\), regardless of how many exemplars are provided, ICL cannot recover the true task distribution in the limit. This existence theorem implies that certain attributes demonstrated in the exemplars cannot be reliably transferred to the model's generated outputs.

Key Insight: Since implicit learning is ineffective, explicit textual guidelines can be employed to compensate. Experiments demonstrate that even simple instructions like "The output must maintain {property}" can significantly improve the model's attribute maintenance capability, with formatting attributes (sentence count, token count) showing the most pronounced improvement.

Core Idea: Automatically learn two types of guidelines from a small amount of training data—Metric Guidelines (MG) to capture linguistic properties, and Output Constraint Guidelines (OCG) to capture formatting properties—to serve as complementary instructions for enhancing long-form generation in LLMs.

Method

Overall Architecture: Five-step Pipeline of LongGuide

LongGuide is an efficient guideline generation algorithm that requires only \(\le 50\) training samples. It generates two types of guidelines in parallel through five steps:

  1. Step 1 — Metric Collection & Selection: From a pool of 27 predefined metrics, CoT prompting is used to let the LLM select the top-5 most important metrics for the current task, repeating for \(K\) rounds to obtain the union.
  2. Step 2 — Metric Self-Scoring: Utilizing LLM + Self-Consistency, the ground-truth answers in the training set are scored on a sample-by-sample basis (scale of 1-5), and averaged to obtain the expected score for each metric.
  3. Step 3 — Metric Guideline (MG) Generation: Convert the metric scores into natural language descriptions (e.g., "Informativeness 4/5" \(\rightarrow\) "should contain a good amount of informative content"), and concatenate them into the MG.
  4. Step 4 — Output Constraint Guideline (OCG) Generation: Use NLTK to calculate the minimum, maximum, and average statistics of the sentence count and token count from the training set outputs, converting them into formatting constraints.
  5. Step 5 — Automated MG-OCG Selection: Compare the ROUGE-L of four configurations (No guideline, MG, OCG, MG+OCG) on the training set, selecting the optimal combination.

Metric Pool Design (Step 1)

The metric pool \(S\) consists of 27 reference-free evaluation metrics from the following sources:

Source Metrics Count
ABC's of Communication (Wagner, 1963) Accuracy, Brevity, Clarity 3
BARTScore (Yuan et al., 2021) Relevance, Coherence 2
GPTScore (Fu et al., 2023) Semantic Coverage, Factuality, Fluency, Informativeness, Consistency, Engagement, Specificity, Correctness, Understandability, Diversity 10
Newly Added by Authors Completeness, Conciseness, Neutrality, Naturalness, Readability, Creativity, Rationalness, Truthfulness, Respect of Chronology, Non-repetitiveness, Indicativeness, Resolution 12

Key Design: Do not collect LM-based metrics (e.g., FactScore) because LLMs struggle to define and self-score such metrics; do not pre-define metrics, as different tasks interpret the same metric differently.

Metric Guideline Generation Mechanism (Step 2-3)

The core mechanism of MG is to let the LLM serve as both the judge and the candidate:

  • Self-Scoring: For each training sample, the LLM uses Self-Consistency to score the ground-truth answers on all selected metrics from 1 to 5, taking the average. This step is separated from Step 1 to ensure independence between the evaluation data and the metric-selection data.
  • Natural Language Conversion: Numeric scores are converted into natural language descriptions, as LLMs understand contextual descriptions better than numeric scores. For instance, an Informativeness score of 4/5 is described as "good amount of informative content".
  • Concatenation into MG: All metric descriptions are concatenated alphabetically to form the complete MG guideline.

Output Constraint Guideline (Step 4)

OCG focuses on 6 formatting statistics: the min, max, and avg of both sentence and token counts of ground-truth answers. The output template is: "The response must have from {min_s} to {max_s} sentences and from {min_t} to {max_t} words with an average of {avg_t} words and {avg_s} sentences."

Automated Combination Selection (Step 5)

Key Findings: Different models possess different inherent knowledge for various tasks, indicating that a single configuration is not universally applicable. For example, in the SWiPE task, since the variance of ground-truth length is extremely high, OCG is actually counterproductive, while MG yields significant improvements; on the other hand, OCG is more effective than MG in translation tasks. Therefore, automated selection on the training set is necessary, requiring the evaluation of only 4 variants.

Key Experimental Results

Main Results: 7 Long-form Generation Tasks (ROUGE-L / GPT-4o-Judge)

Task Model Zero-shot + LongGuide Gain Few-shot + LongGuide Gain
SAMSum Mistral 22.20 / 7.43 28.35 / 7.73 +6.15 / +0.30 27.13 / 7.66 30.65 / 7.72 +3.52 / +0.06
CNN/DM Mistral 19.23 / 7.38 22.46 / 7.45 +3.23 / +0.07 17.56 / 5.84 19.19 / 5.99 +1.63 / +0.15
XL-Sum Mistral 9.19 / 5.96 14.38 / 6.29 +5.19 / +0.33 9.79 / 4.46 15.23 / 5.06 +5.44 / +0.40
SWiPE Mistral 36.60 / 7.21 38.21 / 7.32 +1.61 / +0.11 39.47 / 7.12 41.36 / 7.24 +1.89 / +0.12
CommGen Mistral 10.12 / 5.14 25.20 / 6.81 +15.08 / +1.67 3.98 / 1.34 25.05 / 6.65 +21.07 / +5.31
SAMSum ChatGPT 23.83 / 7.43 30.47 / 7.59 +6.64 / +0.16 22.21 / 7.32 31.46 / 7.72 +9.25 / +0.40
CommGen ChatGPT 24.21 / 6.53 34.41 / 7.23 +10.20 / +0.70 22.08 / 4.19 38.21 / 7.21 +16.13 / +3.02

Average Gain: Mistral +5.39% ROUGE-L, ChatGPT +6.58% ROUGE-L. LongGuide outperforms the APO prompt optimization baseline across all configurations.

ICL Property Transfer Experiments: Attribute Maintenance Rate under 5-shot Exemplars

Model COV FAC CON INF COH REL NT mean / std
Expected 100% 100% 100% 100% 100% 100% 17.00 / 0.00
Mistral-7B-it 38% 80% 78% 17% 75% 88% 50.25 / 55.54
Llama-3.1-8B-it 44% 86% 82% 26% 81% 87% 34.72 / 45.29
Qwen2.5-7B 43% 90% 85% 40% 78% 96% 281.38 / 264.59

Key Findings: Although all exemplars contain exactly 17 output tokens, the model outputs average 50–281 tokens with extremely large standard deviations. The transfer rates of crucial metrics such as Semantic Coverage (COV) and Informativeness (INF) are only 17%–44%.

Ablation Study: MG vs. OCG vs. MG+OCG Optimal Selection Counts

Across 28 experimental groups (2 models \(\times\) 7 tasks \(\times\) 2 settings): MG+OCG wins 15 times, OCG wins 10 times, MG wins 2 times, and "no guideline" wins 1 time. OCG is particularly effective in summarization, translation, and table-to-text tasks, whereas MG holds a stronger advantage in the SWiPE text simplification task.

Highlights & Insights

  • Solid Theory: Rigorously proves by contradiction that ICL cannot recover the true task distribution when \(P_\mathcal{M} \neq P_T\), providing a theoretical foundation for the PT problem.
  • Ingenious Metric Self-evaluation Paradigm: Leverages the LLM's self-evaluation capability (LLM-as-Judge) to discover the attribute dimensions that require optimization in a task, and then turns back to guide the same model's generation using these dimensions—essentially letting "the model teach itself how to do things."
  • Extremely Low Cost: Requires only \(\le 50\) training samples and validation on 4 prompt variants, making its cost approximately 1/3.75 of APO.
  • Cross-Model Transferability: The MG learned by a weaker model (Mistral) can enhance the performance of a stronger model (ChatGPT), but the reverse is not true—stronger models possess better comprehension to exploit weaker guidelines.
  • Complementary to Prompt Optimization: The guidelines from LongGuide can be further optimized by algorithms like APO and adv-ICL, yielding even better synergistic effects.
  • Human Evaluation Validation: Annotators prefer the generation quality of LongGuide outputs 92% of the time, with OCG achieving a human evaluation win rate of up to 95%.

Limitations & Future Work

  • Task-Level Statistics Rather Than Sample-Level: MG and OCG are based on mean statistics from the training set, failing to provide tailor-made guidance for individual instances. This might prove ineffective for tasks with immense output length variance (e.g., Code2Text, StoryGeneration).
  • Dependency on Instruction-Following Capability: Directly applying this to non-instruct-tuned models yields limited performance; such models require guidelines learned from instruction-tuned counterparts.
  • Manually Constructed Metric Pool: While the 27 metrics offer extensive coverage, they might miss crucial attributes specific to specialized domains.
  • Coarse-Grained OCG Constraints: OCG only restricts sentence and token counts, leaving more fine-grained control dimensions—such as paragraph structures, keywords, and tone—unaddressed.
  • Limited Effectiveness on Pre-Trained Tasks: For tasks already covered in the model's pre-training data (e.g., WebNLG, E2E NLG), guidelines may introduce out-of-distribution (OOD) contexts and potentially degrade performance.

Rating

⭐⭐⭐⭐ (4/5)

  • Novelty ⭐⭐⭐⭐: For the first time, it theoretically uncovers the fundamental limitations of ICL in long-form generation, clearly defining the PT problem. The proposed MG+OCG dual-stream guideline framework is logical and practical.
  • Experimental Thoroughness ⭐⭐⭐⭐⭐: Spanning 7 generation tasks and 1 real-world dialogue benchmark (covering summarization, simplification, translation, dialogue, and table-to-text) with experiments on both open-source and closed-source models. It includes comprehensive ablation studies and human evaluations.
  • Value ⭐⭐⭐⭐: The approach is simple, easy to use, cheap (only requiring API calls), and works synergistically with existing prompt optimization techniques, showing strong potential for practical engineering deployment.
  • Writing Quality ⭐⭐⭐⭐: The paper is well-structured. Although the theoretical proof is rigorous, the practical implications of its assumptions warrant further consideration.