CNNSum: Exploring Long-Context Summarization with Large Language Models in Chinese Novels¶

Conference: ACL 2025
arXiv: 2412.02819
Code: GitHub
Area: LLM Efficiency
Keywords: long-context summarization, Chinese novels, benchmark, LLM evaluation, RoPE extrapolation

TL;DR¶

The authors construct CNNSum—a multi-scale long-text summarization benchmark based on Chinese novels (695 samples, 16k-128k tokens)—ensuring quality via human annotation. Through a systematic evaluation of 20+ LLMs, they discover that advanced LLMs tend to generate subjective commentary leading to vague summaries, smaller models offer better cost-effectiveness, fine-tuning Base versions yields superior results over Chat versions, and fine-tuning on short-context data alone can significantly enhance long-text summarization capabilities.

Background & Motivation¶

Background: Despite the rapid development of long-context LLMs (where 128k contexts are now common), research on long-text summarization has progressed slowly, and existing long-text summarization datasets remain severely insufficient.

Limitations of Prior Work: - Existing benchmarks are mostly based on older datasets (such as BookSum, CNNDM, etc.), posing high data leakage risks. - The sample sizes are small (dozens of samples), and the average/maximum lengths are short (typically <16k). - A lack of multi-scale length subsets makes it impossible to evaluate performance across different context lengths. - Annotation quality is poor—either collected from the web (high leakage risk) or synthesized by LLMs (introducing various errors).

Key Challenge: While 128k context windows have become standard, the performance of LLMs in long-text summarization degrades sharply as length increases. Outputs may become chaotic, meaningless, or fail to follow instructions. The core bottleneck lies in the lack of high-quality datasets and systematic research guidance.

Goal: To construct a high-quality multi-scale Chinese long-text summarization benchmark, and to systematically explore the capability boundaries and optimization strategies of LLMs in long-text summarization.

Key Insight: Starting from Chinese web novels (which exhibit high originality and low leakage risk), samples are collected across four distinct scales (L, XL, 2XL, 3XL) and annotated through a human-LLM collaborative process.

Core Idea: Robust research in long-text summarization requires a rigorous benchmark. CNNSum fills the gap in Chinese long-text summarization benchmarks through strict multi-scale design and human annotation.

Method¶

Overall Architecture¶

CNNSum construction pipeline: Corpus Collection → Multi-Scale Sampling → Summary Annotation → Benchmark Evaluation & Exploration

Key Designs¶

1. Corpus Collection and Filtering

Collected 103 Chinese web novels, each characterized by a clear chapter structure.
Excluded books comprising multiple independent short stories or lacking a main plotline.
Utilized Qwen2-72B-Instruct to detect potential popular books (with high leakage risk), filtering out 27 novels.
Applied regular expressions and manual correction to fix non-standard punctuation and remove irrelevant inserted content.

2. Multi-Scale Sampling Strategy

Based on the Yi tokenizer, four target lengths and corresponding ranges are defined:

Subset	Target Length	Sampling Range	Samples	Source Books
L	16k	[12k, 18k]	190	76
XL	32k	[26k, 34k]	195	71
2XL	64k	[54k, 66k]	200	60
3XL	128k	[112k, 130k]	110	45

A sliding window approach is employed to sample along chapters, prioritizing samples from rare books to preserve diversity.

3. Summary Annotation

LLMs are first used to generate plot summaries for each chapter.
23 human annotators review these summaries, select the key plots, and rewrite them.
2XL/3XL samples are annotated by one person and cross-checked by another.
Requirements: (1) Rewrite in the annotator's own words rather than simply deleting or merging sentences; (2) Avoid subjective comments and focus strictly on objective plot events.
Word limits: 500 characters for L/XL, 600 characters for 2XL/3XL.

4. Two Prompt Types

Prompt-IB: Instructions placed at the beginning of the text.
Prompt-IE: Instructions placed at the end of the text.
It is observed that different prompt types significantly impact the output quality (showing substantial MSE discrepancies).

Evaluation Metrics¶

The evaluation primarily uses ROUGE-L, accompanied by human-conducted fine-grained analysis of anomalous output types.

Key Experimental Results¶

Main Results: ROUGE-L Scores¶

Model	L (16k)	XL (32k)	2XL (64k)	3XL (128k)
GPT-4o	15.5	14.2	12.5	-
Gemini-1.5-pro	19.3	18.1	16.8	14.6
Qwen-plus	20.5	18.5	16.4	14.8
Moonshot-v1-128k	22.4	20.3	18.0	15.2
Qwen2.5-72B-Inst	19.6	17.6	13.6	13.4
InternLM2.5-7B-Chat-1M	18.0	17.1	14.7	13.0
Yi-1.5-34B-32K	11.6	10.5	9.6	0.1
Yi-6B-200K	9.9	9.4	8.8	4.0
Llama3.1-8B-Inst	15.6	14.3	12.8	9.9
LWM-Text-1M	3.3	3.0	2.5	1.1

Impact of Prompt Types (MSE between P-IB and P-IE)¶

Model	MSE
Yi-6B-200K	4.5
Yi-1.5-34B-32K	14.5
Yi-1.5-34B-Chat-16K	7.8
Qwen1.5-7B	34.4
Qwen2-72B	16.3
GPT-4o	0.0
Gemini-1.5-pro	0.1

An MSE \(\ge 5.0\) indicates that the prompt type has a profound impact on the model, whereas proprietary models typically exhibit greater stability.

Key Findings¶

GPT-4o favors subjective commentary, resulting in vague summaries. Consequently, its ROUGE-L scores are inferior compared to models like Moonshot and Qwen.
Larger models are not necessarily better: Their advantages in reasoning and comprehension are difficult to leverage in long-text summarization tasks, making smaller models more cost-effective.
Chat/Instruct versions may compromise the summarization capabilities of Base models: Experimental fine-tuning demonstrates that Base models achieve superior results.
Models scaled with RoPE ABF exhibit strong extrapolation potential: Fine-tuning with short-text data alone can significantly improve long-text summarization performance.
Mixed-length samples may lead to misleading evaluation outcomes: Multi-scale isolated evaluation provides more reliable results.

Highlights & Insights¶

"Long-text summarization primarily relies on memory capacity"—this profound insight explains why the reasoning advantages of larger LLMs fail to manifest in this task.
Highly valuable dataset construction methodology: The workflow of multi-scale sampling, sliding windows, and human annotation is highly reusable.
Quantification of prompt type impacts: The MSE metric visually illustrates how prompt placement differently affects various models, offering practical guidance for real-world applications.
Thorough leakage risk control: A multi-pronged approach combining newly published books, LLM-based detection, and rigorous filtering.
Systematic and comprehensive fine-tuning exploration: Validation from multiple angles including Base vs. Chat comparison, short-text training for long-text generalization, and RoPE extrapolation.

Limitations & Future Work¶

Restricted to the Chinese novel domain: The summarization style and structure may not generalize well to other domains like academic papers or news.
Limitations of ROUGE-L evaluation: The paper acknowledges the significant gap between ROUGE-L and human preferences, yet does not propose a superior alternative.
The 3XL subset contains only 110 instances: The statistical reliability of evaluations at the 128k length remains somewhat limited.
Annotation quality depends heavily on annotator comprehension: Long-text annotation is inherently demanding for human evaluators, making individual differences difficult to eliminate completely.
Lack of evaluation on the latest models like o1 or Claude

Compared to BookSum: CNNSum is more recent, multi-scaled, specifically Chinese, and human-annotated, comprehensively outperforming the older benchmark in design.
Compared to CLongEval-LStSum: While the latter utilizes GPT-4 to synthesize labels across mixed lengths, CNNSum achieves greater reliability in extrapolation evaluations.
Insights for fine-tuning strategies: Training on short texts to generalize to long contexts is particularly effective for models scaled with RoPE ABF, offering an affordable path to boost long-context capabilities.
The discovery regarding "subjective commentary vs. objective plot" provides actionable guidance for prompt design, suggesting that LLMs should be explicitly instructed to avoid evaluative language.

Rating¶

Novelty: ⭐⭐⭐ — The core contribution lies in the benchmark construction; the methodology itself is not highly novel.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive evaluation over 20+ models, coverage of both proprietary and open-source models, and a systematic exploration of fine-tuning and extrapolation.
Writing Quality: ⭐⭐⭐⭐ — Highly clear summaries of findings, rich tables, and a well-structured narrative.
Value: ⭐⭐⭐⭐ — Fills the gap in Chinese long-text summarization benchmarks; the insights offer practical guidance for real-world applications.