BIPro: Zero-shot Chinese Poem Generation via Block Inverse Prompting Constrained Generation Framework¶
Conference: ACL 2025
arXiv: 2411.13237
Code: None
Area: LLM/NLP
Keywords: Constrained generation, Classical Chinese poetry, Inverse prompting, Block generative model, GLM
TL;DR¶
This paper proposes the BIPro framework, which leverages the infilling capability of Block Generative Models. Through two block inverse prompting methods, "revise" and "rewrite", it enables the weaker GLM-10B model to outperform GPT-4 and top-performing domain-specific systems in open-ended classical Chinese poetry generation without the need for domain-specific training.
Background & Motivation¶
Constrained writing requires text to satisfy specific constraints (such as rhyme and meter) and serves as an important technique to enhance aesthetic value in literary creation. Poetry is the most well-known application of constrained writing.
Limitations of Prior Work: - Direct generative models (such as GPT-4) generate text autoregressively token-by-token. Since they only consider the prefix and cannot modify previously generated content, they struggle with constrained writing. - Although poems generated by GPT are almost indistinguishable from human works to average readers, professional poets identify a significant gap. - Domain-specific systems (such as Yusheng, Shisanbai) rely on massive domain-specific training data and lack generalization.
Limitations of Inverse Prompting: It scores generation quality by calculating the perplexity of the generated text under an inverse format. However, it relies heavily on the precision of the natural language inverse formulation, which in many cases either does not exist or is imprecise.
Unique Advantages of Block Generative Models: The GLM series models can generate middle text based on both prefix and suffix (non-monotonic generation). This makes "modifying generated content" possible, mirroring the human writing process more closely.
Method¶
Overall Architecture¶
The BIPro framework divides poetry generation into three stages:
- Initial Generation: Generates the poem sentence-by-sentence, with each sentence satisfying Pingshui Rhyme constraints.
- Revise: Immediately after generating each sentence, the block generative model is used to regenerate the preceding sentence. The old sentence is replaced if the new version achieves a higher BIPro score.
- Rewrite: After the entire poem is generated, each sentence is masked and regenerated sequentially. This iterative process repeats for multiple rounds until no further improvement is achieved or the maximum round limit is reached.
This simulates the human poetic composition process: deliberate thinking, revision, and proofreading.
Key Designs¶
Block Inverse Prompting: - Traditional inverse prompting requires converting natural language into an inverse form (e.g., "Write a poem about X" \(\rightarrow\) "The title of this poem should be X"), which is difficult to convert and can lead to imprecise semantics. - BIPro utilizes the block generative model to directly mask the prompt text, calculating the perplexity of reconstructing the prompt as the scoring metric. - This avoids semantic loss from inverse-form conversion and enables evaluation of sentences in intermediate positions (which traditional inverse prompting cannot do).
Beam-based Constrained Generation Strategy: - Multiple beams are maintained. Beams violating Pingshui Rhyme constraints are eliminated and replaced by alternative valid generations from other active beams. - Maintaining a candidate pool helps overcome "dead-end" states under tight constraints. - Finally, all candidate beams are evaluated using the BIPro scorer to select the optimal one.
BIPro Scorer: - Converts the prompt and generated text into a BIPro-formatted prompt and target. - The BIPro prompt is fed into the block generative model, and the perplexity of the target text serves as the BIPro score. - A lower score (lower perplexity) indicates better generation quality.
Algorithms for Revise and Rewrite (Algorithm 1): - Revise: After generating the \(k\)-th sentence, the \((k-1)\)-th sentence is masked and regenerated. It is replaced if the new version has a better BIPro score. - Rewrite: Once the full poem is complete, each sentence is masked and regenerated sequentially. This iterates for multiple rounds (up to \(m\) rounds) until convergence.
Loss & Training¶
BIPro is a zero-shot method that requires no extra training: - The base model is the pre-trained GLM-10B-Chinese. - No fine-tuning on domain-specific data is required. - All improvements steam from inference-time search and scoring strategies. - The computational cost is approximately \(O(mk)\) times that of direct generation (where \(m\) is the number of rewriting rounds and \(k\) is the beam size).
Key Experimental Results¶
Main Results¶
Open-ended Poetry Generation Challenge (42 topics, 6 systems, evaluated by professional poets):
| System | Format (1-5) | Informativeness (1-5) | Relevance (1-5) | Aesthetics (1-5) | Total Score (1-10) | AR (1-10) |
|---|---|---|---|---|---|---|
| Yusheng | 3.43 | 3.24 | 2.40 | 3.08 | 4.62 | 4.66 |
| Shisanbai | 3.68 | 3.34 | 2.94 | 3.01 | 5.13 | 5.16 |
| GPT-4 | 2.50 | 3.19 | 3.71 | 2.67 | 4.79 | 4.60 |
| GLM-4 | 2.58 | 2.95 | 3.70 | 2.46 | 4.72 | 4.40 |
| Baidu Poetry Assistant | 2.66 | 3.17 | 3.73 | 2.51 | 4.76 | 4.70 |
| BIPro | 3.26 | 3.42 | 3.30 | 2.93 | 5.27 | 5.22 |
BIPro achieved the highest total score and AR score, outperforming all baseline methods.
Parallel Poetry Generation Challenge (87 human poems, compared with human works):
| System | Total Score (1-10) | AR (1-10) |
|---|---|---|
| GLM-10B Direct Generation | 4.65 | 4.37 |
| GPT-4 | 4.98 | 4.86 |
| BIPro | 5.54 | 5.43 |
| Human Poems (Daily Masterpieces) | 6.37 | 6.42 |
BIPro significantly reduces the gap between AI and human-written poetry.
Ablation Study¶
The ablation studies in this paper are mainly demonstrated via case studies: - GLM-10B Direct Generation vs. BIPro: Direct generation sometimes replicates existing classical poems, indicating that the generative capability of GLM-10B without BIPro is weak. - Effect of Revise and Rewrite: The case poem "Lament over Life" was rewritten for 5 rounds, showing continuous quality improvement.
Key Findings¶
- BIPro enables the weaker GLM-10B to outperform stronger direct-generation systems (GPT-4, GLM-4) and the best domain-specific systems (Yusheng, Shisanbai).
- BIPro outperforms domain-specific systems in both format and informativeness, while scoring slightly lower in relevance compared to direct-generation systems (like GPT-4, which is better at topic alignment).
- In specific cases, poems generated by BIPro even scored higher than shortlisted human poems (6.70 vs. 6.20).
- The infilling capability of block generative models is the key differentiator in improving constrained generation.
Highlights & Insights¶
- A Paradigm of Weak Models Outperforming Strong Models: Rather than relying on larger model parameters or additional training data, this approach improves generation quality through superior inference-time strategies (search + scoring + iteration).
- Simulating the Human Creative Process: The revise and rewrite mechanisms precisely simulate the human process of repeatedly "polishing and refining" written text.
- Rediscovering the Value of Block Generative Models: While subsequent versions of the GLM series (ChatGLM, GLM-4) abandoned block generation features, this work demonstrates that this feature holds unique value in constrained generation.
- Zero-shot & Training-free: Requires absolutely no training on poetry-domain data, outperforming domain-specific systems solely through inference-time strategies.
Limitations & Future Work¶
- High Computational Complexity: Generating one poem requires around 7,000 tokens (compared to only 50 for direct generation), incurring an \(O(mk)\)-fold overhead.
- Lack of Automated Evaluation: Poem quality evaluation relies heavily on human experts, making it difficult to evaluate at scale or iterate quickly.
- Limited Base Model Choices: Block generative models are extremely rare, currently leaving only GLM-10B and GLM-130B as available options.
- Validated Only on Classical Chinese Poetry: Has not yet been verified on other constrained writing tasks (e.g., English sonnets, couplets, lyric writing).
- Potential Misuse Risks: High-quality constrained generation might potentially be exploited for malicious content creation.
Related Work & Insights¶
- GLM (Du et al., 2022) is the core base model; its block attention mechanism constitutes the foundation of BIPro.
- Inverse Prompting (Zou et al., 2021) is the theoretical predecessor of BIPro, and BIPro overcomes its limitations via block generative models.
- Yusheng (Ma et al., 2023) and Shisanbai are the state-of-the-art domain-specific systems.
- Pingshui Rhyme is the metrical standard for traditional Chinese poetry, originating from the 13th century.
- Inspiration for thought: Can BIPro's "search-score-iterate" paradigm be applied to other creative tasks requiring iterative refinement (such as lyrics, slogans, or code generation)?
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — Block inverse prompting is a highly original method; the result of a weaker model outperforming stronger ones is highly impressive.
- Experimental Thoroughness: ⭐⭐⭐⭐ — The human evaluation is rigorously designed, but there is a lack of automated metrics and exhaustive ablation studies.
- Value: ⭐⭐⭐ — Due to the scarcity of block generative models, practical application scenarios are currently narrow.
- Writing Quality: ⭐⭐⭐⭐ — The paper structure is clear, algorithm descriptions are standardized, and case studies are vivid.