An Empirical Study of Many-to-Many Summarization with Large Language Models¶

Conference: ACL 2025
arXiv: 2505.12983
Code: None
Area: Multilingual NLP / Text Summarization
Keywords: many-to-many summarization, multilingual, LLM, instruction tuning, factual consistency

TL;DR¶

This work presents the first systematic study of Large Language Model (LLM) performance on the Many-to-Many Summarization (M2MS) task. By integrating 8 datasets, the authors construct a benchmark containing 47.8K samples across 5 domains and 6 languages. Evaluating 18 LLMs reveals that zero-shot LLMs perform comparably to fine-tuned traditional models, and significantly outperform them after instruction tuning. However, factual consistency remains a critical bottleneck.

Background & Motivation¶

Background: Many-to-Many Summarization (M2MS) requires models to summarize source documents in any language into summaries in any target language, combining both cross-lingual translation and text summarization capabilities.

Limitations of Prior Work: Existing M2MS studies predominantly utilize traditional pre-trained models (e.g., mBART), lacking a systematic exploration of LLM capabilities. Moreover, existing datasets are limited to single domains, hindering comprehensive evaluation.

Key Challenge: LLMs are naturally endowed with multilingual capabilities and theoretically should serve as excellent M2MS solvers, but this has not been fully verified in practice.

Goal: To systematically evaluate the zero-shot and instruction-tuning performance of LLMs in multi-domain, multilingual M2MS scenarios.

Key Insight: Integrate multi-source datasets to construct a unified benchmark, covering three paradigms: zero-shot, instruction tuning, and traditional models.

Core Idea: Comprehensively reveal the strengths (instruction-tuned models surpassing GPT-4) and weaknesses (exacerbated factual consistency issues) of LLMs in M2MS through 47.8K multi-domain, multilingual samples.

Method¶

Overall Architecture¶

1) M2MS samples are integrated from 8 datasets, covering 5 domains (News, Encyclopedia, Dialogue, Guide, Technology) and 6 languages (En, Cs, De, Fr, Zh, Uk); 2) 18 LLMs are evaluated in a zero-shot setting with meticulously designed prompts containing task instructions and in-context examples; 3) Open-source LLMs undergo instruction tuning (on 19.5K training samples); 4) Fine-grained human evaluation of factuality is conducted. The focus is on a comprehensive comparison across multiple experimental paradigms.

Key Designs¶

Data Integration: Samples are selected from 8 datasets such as CrossSum, XWikis, and WikiLingua, covering 5 domains (News, Encyclopedia, Dialogue, Guide, Technology) and 6 languages (English, Czech, German, French, Chinese, Ukrainian). Low-quality samples are filtered out based on three intrinsic metrics: coverage, redundancy, and coherence.
Data Contamination Control: Target-instance-level contamination is calculated on the test set to ensure the contaminated sample ratio stays below 1%, ensuring a fair evaluation.
Multi-Paradigm Evaluation: A three-track parallel comparison is conducted: zero-shot (with meticulously designed prompts + in-context examples), instruction tuning (19.5K training samples), and fine-tuned traditional models (mBART-50/PISCES).

Loss & Training¶

Traditional models are trained using standard seq2seq objectives.
LLM instruction tuning utilizes the instruction-response format of the training set.
Evaluation metrics: ROUGE-1/2/L, BERTScore, and GPT-4o scoring (on a 5-point scale for conciseness, coherence, and relevance).

Key Experimental Results¶

Main Results (Zero-shot LLMs vs. Fine-tuned Traditional Models, Overall R1/RL/BS)¶

Model	Overall R1/RL/BS
GPT-4o (zero-shot)	26.0 / 16.6 / 66.7
GPT-4 (zero-shot)	25.7 / 16.4 / 66.4
GPT-3.5-turbo	25.2 / 16.1 / 66.7
Vicuna-13B-16k	22.9 / 13.9 / 66.0
Qwen2.5-14B	22.1 / 13.1 / 65.4
LLaMa-2-7B	18.2 / 10.8 / 63.3

Ablation Study (Cross-domain Performance, R1 Metric)¶

Model	News	Encyc.	Dialogue	Guide	Tech.
GPT-4o	19.8	27.9	29.5	25.1	34.2
GPT-4	19.5	26.9	28.9	24.0	33.8
Vicuna-13B-16k	19.0	27.2	22.6	20.3	33.0
Qwen2.5-14B	18.4	25.8	22.0	18.5	32.6

Key Findings¶

Data scale: 19,530 train / 14,150 val / 14,150 test samples, covering 30 language pairs.
Flores translation capability ranking: GPT-4o (29.1) > GPT-4 (27.7) > GPT-3.5 (22.0) > Qwen2.5-14B (19.2).
Models supporting longer contexts (e.g., Vicuna-16k) exhibit a slight advantage in M2MS.
Zero-shot LLMs can already compete with fine-tuned traditional models (mBART-50/PISCES), with GPT-4o achieving the overall best performance.
After instruction tuning, open-source LLMs (e.g., Qwen-14B) can outperform zero-shot GPT-4 on automatic metrics.
Instruction tuning does not sacrifice general task capabilities (MMLU scores remain stable).
Factual consistency is a key bottleneck: Human evaluation shows that open-source LLMs produce more factual errors than GPT-4, and instruction tuning might exacerbate hallucinations.
The technology domain (Tech) achieves the highest scores, while the news domain (News) is the most challenging.
Multilingual translation capability (Flores score) is positively correlated with M2MS performance.

Highlights & Insights¶

This is the first large-scale systematic evaluation of LLMs' M2MS capabilities (evaluating 18 open-source and closed-source LLMs across various multilingual families including Chinese and English).
Revealed the "double-edged sword" effect of instruction tuning: it boosts automatic metrics but may exacerbate hallucinations—providing a crucial warning for alignment research.
Data contamination control is crucial for fair evaluation and represents an essential step in benchmark design in the LLM era.
Multi-domain analysis (across 5 domains) provides fine-grained insights: the technology domain performs the best, while the news domain is the most difficult.
The positive correlation between multilingual translation capability (Flores score) and M2MS performance is intuitive but systematically validated for the first time.
The data integration of 47.8K samples covering 6 languages and 30 language pairs forms a highly valuable benchmark resource.
Validated that instruction tuning does not compromise general capabilities (no drop in MMLU), mitigating concerns about catastrophic forgetting during fine-tuning.

Limitations & Future Work¶

The root cause of instruction tuning exacerbating hallucinations may stem from factual errors in the reference summaries within the training data, necessitating cleaner training data.
The dataset only covers 6 languages (En, Cs, De, Fr, Zh, Uk), leaving low-resource and non-Latin languages unaddressed.
Lacks evaluation of larger LLMs (such as 70B+ models, Mixtral, and other MoE architectures).
Has not explored M2MS capabilities for long documents (>16K tokens).
Factuality control methods for open-source LLMs are not thoroughly investigated; only a diagnostic of the problem is provided.
There may still be a noticeable gap between automatic metrics (ROUGE/BERTScore) and human evaluation.

CrossSum (Bhattacharjee et al., 2023) and PISCES (Wang et al., 2023c) are foundational works in M2MS, proving that M2MS outperforms independent CLS (Cross-Lingual Summarization).
The hallucination issue is closely aligned with general LLM hallucination research (Zhang et al., 2023).
The degradation of factuality in instruction tuning deserves attention in alignment research, potentially requiring a factuality reward.
mBART-50 (Tang et al., 2021), as a traditional baseline, remains competitive, demonstrating the advantages of the encoder-decoder architecture in summarization tasks.

Rating¶

Novelty: ⭐⭐⭐ The task definition and method itself are not entirely new, but the first systematic LLM evaluation is valuable.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Highly comprehensive, covering 18 LLMs, 5 domains, 6 languages, and multiple evaluation approaches.
Writing Quality: ⭐⭐⭐⭐ Rigorous experimental design and in-depth analysis; the explanation of data contamination control reflects the authors' meticulousness.
Value: ⭐⭐⭐⭐ The revealed factual consistency issues provide crucial warnings for practical applications.
Overall Rating: A classic empirical study paradigm. The benchmark contribution is significant, and the findings hold practical value for deploying open-source LLMs.
Reproducibility: The data integration pipeline is clear and can be extended to more languages and domains.
Scalability: Future work can explore factuality-aware instruction tuning to alleviate hallucination issues.
Open Question: How can we design M2MS training data that does not introduce hallucination signals?
Impact: Provides a standardized benchmark and methodology for evaluation of LLMs in the field of multilingual summarization.