SciPostGen: Bridging the Gap between Scientific Papers and Poster Layouts¶
Conference: CVPR 2026 arXiv: 2511.22490 Code: https://omron-sinicx.github.io/paper2layout/ Area: Multimodal VLM / Document Understanding Keywords: Poster layout generation, scientific papers, retrieval-augmented generation, contrastive learning, document layout analysis
TL;DR¶
This work introduces SciPostGen, a large-scale dataset of 18,097 paper–poster pairs. Analysis reveals a moderate correlation between paper structure and the number of poster layout elements. A retrieval-augmented poster layout generation framework is proposed, which leverages contrastive learning to retrieve layout templates matching the input paper and guides an LLM to generate the final poster layout.
Background & Motivation¶
The volume of scientific publications continues to grow (monthly arXiv submissions increased from ~8,000 in 2015 to over 20,000 in 2025), and posters serve as an important medium for efficiently communicating research findings. Automatically generating posters from papers requires addressing two sub-problems: content summarization (what to include) and layout generation (how to arrange). Existing work predominantly focuses on content summarization, with layouts either relying on fixed templates or rule-based generation derived from paper structure. However, layout design significantly affects the effectiveness of information communication, making it worthwhile to learn the paper-to-layout mapping in a data-driven manner.
The core bottleneck is the lack of large-scale paired datasets. Existing poster generation datasets contain only a few hundred paper–poster pairs, insufficient for data-driven approaches. SciPostGen scales this to 18,097 pairs through a combination of automatic annotation and manual correction, providing fine-grained annotations for both papers (OCR text, figure bounding boxes) and posters (8 categories of layout elements).
Analysis reveals exploitable correlations between paper structure and poster layout: papers with more text tend to have fewer figure elements in their posters (Spearman \(\rho < -0.40\)), while the number of paper figures correlates positively with poster figure elements. This motivates a retrieval-augmented layout generation strategy—retrieving poster layouts from structurally similar papers as generation references.
Method¶
Overall Architecture¶
The system consists of two modules: (1) a layout retriever—a contrastive learning-trained paper encoder and layout encoder that map paper page images and poster layout images into a shared embedding space, retrieving the top-3 most similar layouts at inference time; and (2) a layout generator—based on Llama-3.1-8B-Instruct, which takes the retrieved layouts and paper structural information as input and outputs the final layout (category + normalized bounding boxes). Both automatic and semi-automatic modes are supported.
Key Designs¶
-
Contrastive Learning Layout Retriever:
- Function: Retrieve poster layouts that match the structural characteristics of a given paper.
- Mechanism: The paper encoder renders multi-page PDFs as image sequences; each page is processed by DiT (Document image Transformer) to extract patch features, which are aggregated into a paper embedding \(x^p\) via two-level attention pooling (intra-page → inter-page). The layout encoder processes rendered layout images analogously to obtain \(x^l\). The model is trained with an InfoNCE contrastive loss, treating paired paper–layout instances as positives and in-batch others as negatives. At inference, cosine similarity is used to retrieve top-3 layouts from the training set.
- Design Motivation: Moderate correlations exist between paper structure (text volume, figure count) and poster layout, and image-based encoding can implicitly capture these structural features. Retrieving multiple layouts rather than a single template accommodates the diversity of poster designs.
-
LLM Layout Generator:
- Function: Integrate retrieval results and paper structural constraints to generate the final layout.
- Mechanism: Paper structure (section count, figure/table counts and aspect ratios) and retrieved layouts are serialized as text sequences and fed to the LLM, which is prompted to generate a layout sequence \(L = \{(c_i, b_i)\}\). In semi-automatic mode, user-specified partial layout constraints (e.g., two pre-placed largest elements) are additionally provided, and the model completes the remaining elements within those constraints.
- Design Motivation: LLMs flexibly integrate heterogeneous inputs; compared to dedicated layout models based on GANs, Transformers, or Diffusion, they more naturally incorporate retrieval results, paper structure, and user constraints.
-
Semi-Automatic Constraint Mechanism:
- Function: Simulate real-world workflows where a creator places primary elements and the system completes the remaining layout.
- Mechanism: The two largest elements (by area) from the gold layout are provided as constraints, and the system generates the remaining elements. This simulates a "human sets the major structure, AI fills in the details" collaborative mode.
- Design Motivation: Fully automatic generation struggles to satisfy personalized requirements; the semi-automatic mode strikes a balance between practicality and automation.
Loss & Training¶
The retriever is trained with an InfoNCE contrastive loss: \(\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N} \log \frac{\exp(s_{ii})}{\sum_{j=1}^{N}\exp(s_{ij})}\), where \(s_{ij}\) denotes the cosine similarity between paper and layout embeddings. The generator is fine-tuned from Llama-3.1-8B-Instruct using silver layout annotations from the SciPostGen training set.
Key Experimental Results¶
Main Results¶
Layout Retrieval Performance (Paper → Layout Retrieval)
| Method | Recall@1 | Recall@3 | Recall@5 |
|---|---|---|---|
| Random | 0.05 | 0.15 | 0.25 |
| Paper encoder only | 4.83 | 12.12 | 18.11 |
| Full retriever | 8.20 | 19.87 | 28.37 |
Layout Generation Quality (FID / mIoU / Alignment)
| Configuration | FID ↓ | mIoU ↑ | Overlap ↓ |
|---|---|---|---|
| Without retrieval | baseline | baseline | baseline |
| + Retrieval augmentation | improved | improved | reduced |
| + Retrieval + constraints (semi-automatic) | best | best | lowest |
Ablation Study¶
| Configuration | Retrieval Recall@3 | Generation FID | Notes |
|---|---|---|---|
| Image encoding only (DiT) | 19.87 | — | Base retrieval performance |
| Layout annotation encoding only | lower | — | Image encoding outperforms structured annotations |
| Direct generation without retrieval | — | higher | Poor quality without reference layouts |
| Top-1 retrieval | — | moderate | Insufficient diversity with single template |
| Top-3 retrieval | — | lowest | Multiple templates provide better guidance |
Key Findings¶
- The Spearman correlation between paper structure and poster layout is moderate (|ρ| ≈ 0.40–0.50), indicating that structural information is useful but insufficient to fully determine the layout.
- Image-based encoding outperforms directly using layout annotations as input—images implicitly preserve spatial relationships.
- In semi-automatic mode, the addition of constraints substantially improves consistency between generated and ground-truth layouts.
- The mAP@0.50:0.95 between silver (automatically annotated) and gold (manually corrected) layouts is 0.53, indicating moderate agreement.
Highlights & Insights¶
- The dataset construction methodology is instructive: automatic annotation (Azure Document Intelligence + Nougat OCR) combined with manual correction for validation/test sets balances scale and quality—a practical strategy under limited annotation resources.
- The research problem of "paper structure → poster layout" is itself novel: prior work focuses on content summarization; this paper is the first to systematically study the structure-to-layout mapping and quantitatively analyze their correlations.
- Retrieval-augmented strategy: using retrieved layouts from structurally similar papers as "reference designs" to guide generation is more controllable and diverse than generating from scratch.
Limitations & Future Work¶
- Only layout bounding boxes are generated, not actual poster content (text, images), leaving a substantial gap to end-to-end poster generation.
- The dataset is limited to computer science conferences (CVPR/ICLR/ICML/NeurIPS); poster styles in other disciplines may differ.
- Retrieval Recall@1 is only 8.2%, suggesting that the paper-to-layout mapping remains weak and may require richer paper representations.
- The subjective quality of generated layouts (e.g., readability, aesthetics) is not evaluated; only numerical metrics are used.
Related Work & Insights¶
- vs. PosterLayout [56]: Provides fine-grained layout annotations but only hundreds of pairs; SciPostGen is 30× larger in scale.
- vs. rule-based layout: Methods such as [42] derive layouts from predefined rules based on paper structure, lacking flexibility and diversity.
- vs. general layout generation: Advertisement/webpage layout generation methods cannot leverage paper structural information; this work introduces the paper as a conditioning signal.
Rating¶
- Novelty: ⭐⭐⭐⭐ The research problem (paper → poster layout) is novel and the dataset is valuable, though the method itself is a standard retrieval-augmented + LLM combination.
- Experimental Thoroughness: ⭐⭐⭐ Lacks user studies and subjective evaluation; quantitative metrics for retrieval and generation are not comprehensive enough.
- Writing Quality: ⭐⭐⭐⭐ The dataset construction and analysis sections are clear; the overall structure is well-organized.
- Value: ⭐⭐⭐⭐ The dataset is a valuable contribution to the community, and the framework lays a foundation for automated academic poster generation.