Skip to content

SciPostGen: Bridging the Gap between Scientific Papers and Poster Layouts

Conference: CVPR 2026 arXiv: 2511.22490 Code: https://omron-sinicx.github.io/paper2layout/ Area: Multimodal VLM / Document Understanding Keywords: Poster layout generation, scientific papers, retrieval-augmented generation, contrastive learning, document layout analysis

TL;DR

This work introduces SciPostGen, a large-scale dataset of 18,097 paper–poster pairs. Analysis reveals a moderate correlation between paper structure and the number of poster layout elements. A retrieval-augmented poster layout generation framework is proposed, which leverages contrastive learning to retrieve layout templates matching the input paper and guides an LLM to generate the final poster layout.

Background & Motivation

The volume of scientific publications continues to grow (monthly arXiv submissions increased from ~8,000 in 2015 to over 20,000 in 2025), and posters serve as an important medium for efficiently communicating research findings. Automatically generating posters from papers requires addressing two sub-problems: content summarization (what to include) and layout generation (how to arrange). Existing work predominantly focuses on content summarization, with layouts either relying on fixed templates or rule-based generation derived from paper structure. However, layout design significantly affects the effectiveness of information communication, making it worthwhile to learn the paper-to-layout mapping in a data-driven manner.

The core bottleneck is the lack of large-scale paired datasets. Existing poster generation datasets contain only a few hundred paper–poster pairs, insufficient for data-driven approaches. SciPostGen scales this to 18,097 pairs through a combination of automatic annotation and manual correction, providing fine-grained annotations for both papers (OCR text, figure bounding boxes) and posters (8 categories of layout elements).

Analysis reveals exploitable correlations between paper structure and poster layout: papers with more text tend to have fewer figure elements in their posters (Spearman \(\rho < -0.40\)), while the number of paper figures correlates positively with poster figure elements. This motivates a retrieval-augmented layout generation strategy—retrieving poster layouts from structurally similar papers as generation references.

Method

Overall Architecture

The system consists of two modules: (1) a layout retriever—a contrastive learning-trained paper encoder and layout encoder that map paper page images and poster layout images into a shared embedding space, retrieving the top-3 most similar layouts at inference time; and (2) a layout generator—based on Llama-3.1-8B-Instruct, which takes the retrieved layouts and paper structural information as input and outputs the final layout (category + normalized bounding boxes). Both automatic and semi-automatic modes are supported.

Key Designs

  1. Contrastive Learning Layout Retriever:

    • Function: Retrieve poster layouts that match the structural characteristics of a given paper.
    • Mechanism: The paper encoder renders multi-page PDFs as image sequences; each page is processed by DiT (Document image Transformer) to extract patch features, which are aggregated into a paper embedding \(x^p\) via two-level attention pooling (intra-page → inter-page). The layout encoder processes rendered layout images analogously to obtain \(x^l\). The model is trained with an InfoNCE contrastive loss, treating paired paper–layout instances as positives and in-batch others as negatives. At inference, cosine similarity is used to retrieve top-3 layouts from the training set.
    • Design Motivation: Moderate correlations exist between paper structure (text volume, figure count) and poster layout, and image-based encoding can implicitly capture these structural features. Retrieving multiple layouts rather than a single template accommodates the diversity of poster designs.
  2. LLM Layout Generator:

    • Function: Integrate retrieval results and paper structural constraints to generate the final layout.
    • Mechanism: Paper structure (section count, figure/table counts and aspect ratios) and retrieved layouts are serialized as text sequences and fed to the LLM, which is prompted to generate a layout sequence \(L = \{(c_i, b_i)\}\). In semi-automatic mode, user-specified partial layout constraints (e.g., two pre-placed largest elements) are additionally provided, and the model completes the remaining elements within those constraints.
    • Design Motivation: LLMs flexibly integrate heterogeneous inputs; compared to dedicated layout models based on GANs, Transformers, or Diffusion, they more naturally incorporate retrieval results, paper structure, and user constraints.
  3. Semi-Automatic Constraint Mechanism:

    • Function: Simulate real-world workflows where a creator places primary elements and the system completes the remaining layout.
    • Mechanism: The two largest elements (by area) from the gold layout are provided as constraints, and the system generates the remaining elements. This simulates a "human sets the major structure, AI fills in the details" collaborative mode.
    • Design Motivation: Fully automatic generation struggles to satisfy personalized requirements; the semi-automatic mode strikes a balance between practicality and automation.

Loss & Training

The retriever is trained with an InfoNCE contrastive loss: \(\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N} \log \frac{\exp(s_{ii})}{\sum_{j=1}^{N}\exp(s_{ij})}\), where \(s_{ij}\) denotes the cosine similarity between paper and layout embeddings. The generator is fine-tuned from Llama-3.1-8B-Instruct using silver layout annotations from the SciPostGen training set.

Key Experimental Results

Main Results

Layout Retrieval Performance (Paper → Layout Retrieval)

Method Recall@1 Recall@3 Recall@5
Random 0.05 0.15 0.25
Paper encoder only 4.83 12.12 18.11
Full retriever 8.20 19.87 28.37

Layout Generation Quality (FID / mIoU / Alignment)

Configuration FID ↓ mIoU ↑ Overlap ↓
Without retrieval baseline baseline baseline
+ Retrieval augmentation improved improved reduced
+ Retrieval + constraints (semi-automatic) best best lowest

Ablation Study

Configuration Retrieval Recall@3 Generation FID Notes
Image encoding only (DiT) 19.87 Base retrieval performance
Layout annotation encoding only lower Image encoding outperforms structured annotations
Direct generation without retrieval higher Poor quality without reference layouts
Top-1 retrieval moderate Insufficient diversity with single template
Top-3 retrieval lowest Multiple templates provide better guidance

Key Findings

  • The Spearman correlation between paper structure and poster layout is moderate (|ρ| ≈ 0.40–0.50), indicating that structural information is useful but insufficient to fully determine the layout.
  • Image-based encoding outperforms directly using layout annotations as input—images implicitly preserve spatial relationships.
  • In semi-automatic mode, the addition of constraints substantially improves consistency between generated and ground-truth layouts.
  • The mAP@0.50:0.95 between silver (automatically annotated) and gold (manually corrected) layouts is 0.53, indicating moderate agreement.

Highlights & Insights

  • The dataset construction methodology is instructive: automatic annotation (Azure Document Intelligence + Nougat OCR) combined with manual correction for validation/test sets balances scale and quality—a practical strategy under limited annotation resources.
  • The research problem of "paper structure → poster layout" is itself novel: prior work focuses on content summarization; this paper is the first to systematically study the structure-to-layout mapping and quantitatively analyze their correlations.
  • Retrieval-augmented strategy: using retrieved layouts from structurally similar papers as "reference designs" to guide generation is more controllable and diverse than generating from scratch.

Limitations & Future Work

  • Only layout bounding boxes are generated, not actual poster content (text, images), leaving a substantial gap to end-to-end poster generation.
  • The dataset is limited to computer science conferences (CVPR/ICLR/ICML/NeurIPS); poster styles in other disciplines may differ.
  • Retrieval Recall@1 is only 8.2%, suggesting that the paper-to-layout mapping remains weak and may require richer paper representations.
  • The subjective quality of generated layouts (e.g., readability, aesthetics) is not evaluated; only numerical metrics are used.
  • vs. PosterLayout [56]: Provides fine-grained layout annotations but only hundreds of pairs; SciPostGen is 30× larger in scale.
  • vs. rule-based layout: Methods such as [42] derive layouts from predefined rules based on paper structure, lacking flexibility and diversity.
  • vs. general layout generation: Advertisement/webpage layout generation methods cannot leverage paper structural information; this work introduces the paper as a conditioning signal.

Rating

  • Novelty: ⭐⭐⭐⭐ The research problem (paper → poster layout) is novel and the dataset is valuable, though the method itself is a standard retrieval-augmented + LLM combination.
  • Experimental Thoroughness: ⭐⭐⭐ Lacks user studies and subjective evaluation; quantitative metrics for retrieval and generation are not comprehensive enough.
  • Writing Quality: ⭐⭐⭐⭐ The dataset construction and analysis sections are clear; the overall structure is well-organized.
  • Value: ⭐⭐⭐⭐ The dataset is a valuable contribution to the community, and the framework lays a foundation for automated academic poster generation.