POEMetric: The Last Stanza of Humanity¶

Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=9VkJ058cTa
Code: https://github.com/Bingru-Li/POEMetric
Area: LLM Evaluation / Creative Writing Benchmark
Keywords: Poetry Generation, LLM Evaluation, Creativity, LLM-as-a-judge, Human Expert Verification

TL;DR¶

This paper proposes POEMetric, the first framework for systematically evaluating poetry generation. Using 10 dimensions (basic instruction following + advanced creative abilities + global evaluation), a dataset of 203 human fixed-form poems, and 6,090 poems generated by 30 LLMs, the authors employ "Rule-based Algorithms + LLM Judges + Human Experts" for cross-verification. Results quantitatively demonstrate that while top-tier LLMs approach perfect scores in meter and theme, they remain far inferior to human poets in creativity, idiosyncrasy, emotional resonance, imagery, and rhetoric—the core elements that define poetry.

Background & Motivation¶

Background: LLMs have shown remarkable performance in logical tasks such as mathematics, coding, and reasoning. However, the arts and humanities, particularly literary writing, remain less explored. The authors view poetry—characterized by the simultaneous compression of linguistic precision, emotional resonance, and cultural literacy within constrained forms—as the ideal prism to examine LLM generative capabilities. Its strict meter and concise length allow for quantifiable diagnostics.

Limitations of Prior Work: Existing poetry generation efforts (e.g., ByGPT5, PoeLM, GPoet) primarily focus on "formal accuracy of meter and rhyme," proving LLMs can write structurally compliant verses. Existing evaluations either rely on objective formal metrics like meter/rhyme accuracy, BLEU, and perplexity, or general text generation dimensions such as fluency and coherence. The true soul of poetry—creativity, authorial intent, emotion, imagery, and rhetorical beauty—has lacked systematic evaluation.

Key Challenge: The essence of poetic quality lies in "advanced, subjective abilities requiring literary criticism training," which are the most difficult to quantify automatically. Conversely, total reliance on human experts for poem-by-poem annotation is unscalable due to the scarcity of experts and the time-intensive nature of the task. Formal compliance \(\neq\) good poetry, but existing metrics only measure form.

Goal: To construct an evaluation framework that covers advanced creative dimensions, is capable of processing over 6,000 poems, and produces credible results, thereby answering "how far SOTA LLMs remain from human poets."

Key Insight: Distill elements that critics focus on in traditional literary criticism (specifically "Practical Criticism") into a set of scorable dimensions. Use a "triangulated verification" protocol involving objective rule-based indicators, LLM judges, and small-sample human expert validation to balance scale and credibility.

Core Idea: Transform the subjective matter of "poetic quality" into a reproducible benchmark using a literary theory-grounded 10-dimensional metric and a three-way verification protocol, placing LLMs and human poets on the same scale for comparison.

Method¶

Overall Architecture¶

POEMetric is not a model but a poetry evaluation benchmark consisting of a "dataset + metric system + evaluation protocol." The workflow begins with constructing a finely annotated human poetry dataset (203 poems across 7 fixed forms) as the gold standard and prompt source. Each human poem's "form + meter + rhyme + theme" is fed into a prompt template for 30 LLMs to generate 6,090 machine-written poems. These are assessed across 10 dimensions (categorized into basic instruction following, advanced creative ability, and global evaluation). Finally, three distinct methods—handcrafted rule-based algorithms, Gemini-2.5-Pro as an LLM judge, and 7 human experts—score the poems and cross-validate the results to ensure automation credibility. The input is "fixed-form + theme," and the output is a "Human vs. LLM" score comparison across the 10 dimensions.

The pipeline contains no trainable parameters or feedback loops; it is a linear evaluation pipeline of "Data \(\rightarrow\) Multi-model Batch Generation \(\rightarrow\) Multi-judge Multi-dimension Scoring \(\rightarrow\) Consistency Verification." The core value lies in the dataset labeling, the definition of the 10 dimensions, and the triangulated validation.

Key Designs¶

1. Human Fixed-form Gold Standard Dataset: Anchoring Subjective Evaluation with Quantifiable Constraints

The authors deliberately focus on fixed-form poetry rather than free verse. The reasoning is straightforward: fixed forms have explicit metrical and rhyming constraints, providing a quantifiable objective baseline before verifying more subjective advanced metrics. They collected 1,309 poems from the Poetry Foundation and Academy of American Poets, using custom algorithms to detect meter/rhyme, retaining 203 poems that strictly follow rhythmic patterns: 95 ballads, 71 sonnets, 12 villanelles, 9 ghazals, 7 sestinas, 6 limericks, and 3 pantoums. These cover a span from the 1800s to the present. Each entry is annotated with author, title, source, form, meter, rhyme scheme, theme, and imagery. This dataset serves as both the gold standard and the prompt source, ensuring a fair human-machine comparison.

2. POEMetric Ten Dimensions: Distilling Literary Criticism into Scorable Metrics

This is the core contribution, with dimensions split into three layers. Basic Instruction Following (2 dims): form accuracy (adherence to meter/rhyme) and theme alignment. Advanced Creative Abilities (6 dims): creativity (novelty), lexical diversity (vocabulary richness), idiosyncrasy (reflection of authorial voice), emotional resonance (evocation of feeling), use of literary devices (simile, metaphor, personification, allusion), and use of imagery (sensory and vivid depictions). Global Evaluation (2 dims): overall poem quality and authorship estimation (judgment of human vs. AI origin). These 6 advanced dimensions are distilled from the focal points of critics in "Practical Criticism," filling the gap between "valid sentences" and "good poetry."

3. Triangulated Verification Protocol: Rules + LLM Judge + Human Expert Validation

To balance scale and credibility, the authors use three heterogeneous judgment methods. Rule-based algorithms handle objective dimensions: detecting meter/rhyme for form accuracy; using MATTR (Moving Average Type-Token Ratio) for lexical diversity; and quantifying creativity as the "repetition rate of LLM poems relative to human originals" (higher repetition implies lower creativity). LLM judges perform large-scale subjective scoring. A pilot study comparing Gemini-2.5-Pro, DeepSeek-R1, and GPT-4o found Gemini-2.5-Pro had the highest human consistency (\(\mathrm{PAo} = 0.662\) vs. \(0.548/0.438\)) and best discriminative power for overall quality (Std. Dev. \(0.63\) vs. \(0.20/0.22\)), leading to its selection as the primary judge (5-point Likert scale). Human expert validation ensures judge credibility: 7 experts with backgrounds in poetry or English literature (including professional poets and professors) anonymously evaluated a representative subset of 58 poems. Agreement was measured via Proportion Agreement Observed:

\[\mathrm{PAo} = \frac{2A}{n_A + n_B}\]

Where \(A\) is the number of agreements and \(n_A, n_B\) are the total scores. Gemini achieved a \(\mathrm{PAo}\) of \(0.662\) with humans, supplemented by a Quadratic Weighted Kappa \(\kappa=0.361\) and Spearman \(\rho=0.378\), comparable to established LLM-human consistency studies (\(\rho \approx 0.41 \sim 0.42\)), confirming the robustness of the automated evaluation.

Key Experimental Results¶

Main Results¶

Submissions were generated from 30 LLMs across 7 companies. With Gemini-2.5-Pro as the judge (Max 5.00), representative results are:

Dimension	Human	Best LLM	Note
Form Accuracy	—	Gemini-2.5-Pro 4.26	Top LLMs have high formal accuracy; Llama-3.3-70B is only 2.29
Theme Alignment	—	Gemini-2.5-Pro 4.99	Theme alignment is generally near-perfect
Creativity	4.02	DeepSeek-R1 3.31	Humans lead significantly
Idiosyncrasy	3.95	DeepSeek-R1 2.17	LLMs are weakest here; largest gap
Emotional Resonance	4.06	DeepSeek-R1 3.53	Humans lead
Imagery	4.49	DeepSeek-R1 4.30	Humans lead
Literary Devices	4.67	DeepSeek-R1 4.38	Humans lead
Lexical Diversity	3.82	DeepSeek-R1 3.85	The only dimension where LLMs surpass humans
Overall Quality	4.22	DeepSeek-R1 3.20	Human victory by a wide margin

Core conclusion: LLMs nearly match or exceed humans in "Basic Instruction Following," but collectively lag in all advanced creative dimensions (except lexical diversity). In overall quality, humans (4.22) decisively outperform the best LLM (3.20).

Rule-based Metrics and Scaling Analysis¶

Phenomenon	Data	Meaning
Rule Form Accuracy	Gemini-2.5-Pro 0.50, Claude-3.7 0.47	Algorithmic detection distinguishes formal ability
MATTR	LLM > Human	LLMs exhibit higher lexical diversity
Repetition Rate	LLM >> Human	LLMs show signs of "mimicry" of originals
Parameter Scale	Larger is better within families	"Reasoning" models aren't necessarily better (GPT-4o > o1/o3-mini)
Distilled Models	Generally weaker than originals	Exception: Distill-Llama-3.3-70B outperformed its base

Key Findings¶

Idiosyncrasy is the largest gap: LLMs lack personal uniqueness and lived experience. The best LLM scored 2.17 compared to 3.95 for humans, suggesting "individuality" is the hardest trait for models to acquire.
Form is Easy, Soul is Hard: Top LLMs achieve near-perfect scores in meter/rhyme/theme but fail to produce creative, emotional, or personally stamped poems, confirming that "valid sentences \(\neq\) good poetry."
Judges Recognize Authorship: Without being told the author, Gemini-2.5-Pro identified 80 of the 203 human poems (39.4%) through memorization or style recognition. Human experts recognized fewer but almost always correctly identified "this is human-written."
Reasoning \(\neq\) Poetic Skill: Enhanced reasoning does not necessarily translate to better poetry. DeepSeek-R1-Distill was mostly worse than the original, indicating creativity and reasoning are distinct capabilities.

Highlights & Insights¶

Repetition Rate as a Proxy for Creativity: Quantifying creativity as the "repetition rate of LLM poems relative to human originals" uses a computable objective value to approximate a highly subjective concept. Higher repetition indicates imitation; this metric is cheap, reproducible, and transferable.
Triangulated Verification: Using rules, LLMs, and humans ensures coverage of objective/scale/credibility. Quantifying consistency with PAo, Kappa, and Spearman sets a paradigm for making "subjective art evaluation" credible.
Fixed-form as a Scaffold: Establishing a quantifiable baseline with strict constraints before verifying subjective dimensions is a methodological strategy that could extend to free verse or other creative writing (novels, essays).
The "Last Stanza" Metaphor: The title "The Last Stanza of Humanity" echoes the finding that idiosyncrasy and emotional resonance—what makes us human—are the most difficult poetic barriers for current LLMs to cross.

Limitations & Future Work¶

English Only: The authors acknowledge the study only examines English poetry. While claiming POEMetric applies to low-resource languages, this was not empirically tested.
Fixed-form Limitation: Free verse, the mainstream of modern poetry, was excluded. It remains unknown if LLM formal advantages in fixed forms translate to unconstrained writing.
Small Human Sample: Experts only evaluated 58 poems (some analyses used only 13), limiting statistical power. The "human gold standard" for creative dimensions remains inherently subjective.
LLM Judge Bias: Although Gemini-2.5-Pro was the most consistent in the pilot, using a single LLM can introduce systematic stylistic preferences. Furthermore, the judge "recognized" 40% of originals, suggesting quality scores might be contaminated by training data leakage.
Improvements: Extending to multiple languages and free verse, using an ensemble of LLM judges to reduce bias, and de-memorizing inputs (e.g., surface paraphrasing) would enhance benchmark fairness.

Vs. Form-oriented Generation (ByGPT5 / PoeLM / GPoet): These focus on embedding structural metrics into generation. Ours builds an evaluation framework focusing on quality, filling the gap for advanced creative assessment.
Vs. ProFTAP (Turing-style Evaluation): ProFTAP uses "indistinguishability from humans" as a single criterion. POEMetric decomposes this into 10 interpretable dimensions, identifying exactly where LLMs fall short.
Vs. Yu et al.'s LLM-as-a-judge for Poetry: The latter uses general dimensions (fluency, relevance). Ours introduces poetry-specific dimensions like idiosyncrasy and imagery, using rule-based and expert verification for higher credibility.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First to distill literary criticism into 10 quantifiable dimensions with triangulated verification.
Experimental Thoroughness: ⭐⭐⭐⭐ Large scale with 6,090 poems across 30 models, though human subset is small and limited to English fixed-form.
Writing Quality: ⭐⭐⭐⭐⭐ Solid motivation, natural integration of literary theory and quantitative metrics, and clear, thematic conclusions.
Value: ⭐⭐⭐⭐⭐ Provides a reproducible yardstick and diagnostic breakdown for "how far LLM creativity is from humanity."