Skip to content

HybriDLA: Hybrid Generation for Document Layout Analysis

Conference: AAAI 2026 arXiv: 2511.19919 Code: GitHub Area: Document Analysis / Object Detection Keywords: Document layout analysis, diffusion models, autoregressive generation, hybrid decoding, multi-scale feature fusion

TL;DR

HybriDLA is the first approach to unify diffusion-based bounding box refinement and autoregressive query expansion within a single decoding layer, simulating a human coarse-to-fine reading strategy for document layout analysis. It achieves 83.5% mAP on DocLayNet with a vision-only model, approaching multimodal systems.

Background & Motivation

Document layout analysis (DLA) is a fundamental task for document understanding and information extraction. Modern documents exhibit increasing structural complexity, with the number of layout elements per page ranging from as few as 2 to as many as 200. Traditional methods such as Faster R-CNN rely on a fixed number of region proposals, while DETR-based methods employ a fixed set of learnable queries. When the actual number of elements deviates significantly from the preset query count, fixed-query schemes either miss detections or introduce a large number of spurious "empty object" queries, incurring dual costs in efficiency and accuracy.

Existing diffusion-based detectors (e.g., DiffusionDet) introduce iterative refinement but still depend on a fixed-size pool of initial noisy boxes; autoregressive detectors (e.g., Pix2Seq) support variable-length sequences but incur linearly growing inference costs and lack spatial refinement capability.

HybriDLA's core motivation is to emulate the human document reading strategy: first scanning the entire page to grasp major regions, then progressively zooming into each region and dynamically adjusting the granularity of attention. Accordingly, the authors complementarily integrate diffusion-based refinement (responsible for iterative denoising of spatial coordinates) and autoregressive expansion (responsible for semantic awareness and dynamic query generation) within a unified decoder.

Method

Overall Architecture

HybriDLA adopts a two-stage hierarchical generation pipeline: a Feature Fusion Encoder (FFE) followed by a Hybrid Generative Decoder. The encoder processes multi-scale features extracted by the backbone to produce coarse layout priors; the decoder leverages these priors for autoregressive query expansion and diffusion-based box refinement, progressively generating precise layout elements and semantic labels in a coarse-to-fine manner across decoding layers.

Key Designs

  1. Feature Fusion Encoder (FFE):

  2. Function: Fuses the backbone's multi-scale feature maps \(F_{l=1}^{L}\) into a unified spatially-aware representation \(G\).

  3. Mechanism: Consists of two steps—local feature encoding and cross-scale fusion. Local encoding \(H_l = \phi(F_l)\) combines self-attention and convolution to capture long-range dependencies and local texture patterns within each scale; cross-scale fusion \(G = \Psi(H_{l=1}^L)\) employs cross-scale attention and lateral convolutional layers to enable adaptive information exchange across scales, allowing fine-grained feature maps to acquire global context and coarse feature maps to acquire local detail.
  4. Design Motivation: Layout elements in documents vary dramatically in scale (titles, footnotes, and figures differ greatly in size), making single-scale representations insufficient to capture both global structure and local detail simultaneously.

  5. Autoregressive Query Expansion (AQE):

  6. Function: Models the query generation process autoregressively, dynamically determining how many queries to generate and their semantic content.

  7. Mechanism: Given image features \(X\), the model defines a joint distribution over a variable-length query sequence \(Q = (q_1, q_2, \dots, q_N)\), factorized as \(P(Q|X) = \prod_{t=1}^{N} P(q_t | X, q_{1:t-1}) \cdot P(\text{EOS} | X, q_{1:N})\). At each step, the next query is conditioned on the existing query context, and expansion terminates adaptively via a learned EOS criterion.
  8. Design Motivation: The number of elements varies enormously across documents, inherently limiting fixed-query methods such as DETR. The autoregressive formulation enables the model to condition queries on prior context and dynamically adjust the query count according to document complexity.

  9. Diffusion-based Refinement (DR):

  10. Function: Models layout prediction as an implicit denoising operation, applying residual corrections to current predictions at each decoding layer.

  11. Mechanism: The update rule is \(\hat{y}^{(t+1)} = \hat{y}^{(t)} + \Delta^{(t)}\), where \(\Delta^{(t)}\) denotes the predicted residual at step \(t\). Within each decoding layer, self-attention enables queries to share contextual information, cross-attention integrates visual features, and the feed-forward network applies the residual correction.
  12. Design Motivation: Directly predicting precise coordinates is difficult; iterative denoising progressively eliminates spatial errors. During training, a subset of queries is initialized with perturbed ground-truth boxes (denoising queries), and layer-wise intermediate supervision is applied to accelerate convergence.

Loss & Training

  • Set prediction loss with Hungarian matching (consistent with DETR).
  • Denoising training: a subset of queries is initialized from perturbed ground-truth boxes, forcing the network to recover correct layouts from degraded inputs.
  • Layer-wise intermediate supervision: each decoding layer has auxiliary loss heads to ensure intermediate predictions remain close to ground truth.
  • All models are trained with a unified batch size of 40 for fair comparison.

Key Experimental Results

Main Results (DocLayNet)

Method Type Backbone Detector mAP (%)
Traditional region-based ResNet-101 Mask R-CNN 73.5
DETR-based InternImage RoDLA 80.5
DETR-based (multimodal) DLAFormer 83.8
Diffusion-based Swin-L DiffusionDet 76.3
Autoregressive ViT-L Pix2Seq 72.5
Hybrid (Ours) InternImage HybriDLA 83.5
Hybrid (Ours) Swin-L HybriDLA 80.4
Hybrid (Ours) ResNet-50 HybriDLA 74.4

HybriDLA achieves 83.5% mAP with vision-only input, trailing the multimodal DLAFormer by only 0.3%. Compared to methods using the same backbone, it achieves an average gain of approximately 3% mAP.

Main Results (M6Doc, 74 categories)

Method Type Backbone Detector mAP (%)
Traditional region-based DiT Cascade R-CNN 70.2
DETR-based InternImage RoDLA 70.0
Diffusion-based Swin-L DiffusionDet 62.7
Hybrid (Ours) InternImage HybriDLA 71.4
Hybrid (Ours) ViT-L HybriDLA 68.6

Ablation Study

Configuration mAP (%) Note
DETR baseline (ResNet-50) 74.2 No AQE
+ AQE 74.4 Marginal gain from autoregressive expansion
Deformable DETR + AQE 76.3 Larger gain with stronger baseline
DINO + AQE 76.8 +1.5%
DE + DR + AQE (Swin-L) 78.1 Standard encoder
FFE + AQE (Swin-L) 79.1 FFE outperforms DE by 1.0%
FFE + DR + AQE (Swin-L) 80.4 DR adds further 1.3%
FFE + DR + AQE (InternImage) 83.5 Best configuration

Key Findings

  • The gain from AQE is more pronounced with stronger baseline detectors, indicating complementarity.
  • FFE yields clear benefits with large, feature-rich backbones (Swin-L: +2.3% vs. DE) but may degrade performance on smaller models.
  • DR provides consistent gains across nearly all backbones (0.8%–1.3%), with no change observed on ResNet-50.
  • Query count analysis reveals an optimal query budget per model: smaller models saturate at 30 expanded queries, while larger models require up to 300.

Highlights & Insights

  • HybriDLA is the first to unify diffusion-based and autoregressive generation paradigms for document layout analysis, representing a conceptually novel contribution.
  • The cognitively-motivated coarse-to-fine strategy has clear methodological correspondences: AQE corresponds to "scanning the page to discover new regions," and DR corresponds to "focusing on details for precise localization."
  • The vision-only model nearly matches the multimodal system (83.5% vs. 83.8%), demonstrating that significant headroom remains in exploiting purely visual features.
  • The architecture is backbone-agnostic and integrates seamlessly with ResNet, ViT, Swin, InternImage, and other backbone families.

Limitations & Future Work

  • The method relies solely on visual input and does not leverage multimodal signals such as OCR text and coordinate metadata, potentially limiting semantic disambiguation.
  • The hybrid generation mechanism incurs substantial inference cost, constraining real-time processing and large-scale deployment.
  • Future work may incorporate multimodal features or document metadata to enhance semantic understanding.
  • Model distillation or architectural optimization could reduce inference overhead.
  • The DETR family (DINO, Deformable DETR) provides a solid foundation for set prediction, and HybriDLA demonstrates the value of incorporating generative components on top of these baselines.
  • The iterative denoising concept from DiffusionDet is elegantly combined with autoregressive expansion, highlighting the complementarity of different generative paradigms in detection tasks.
  • The hybrid generation strategy proposed in this work is generalizable to general object detection in scenes with highly variable numbers of elements.

Rating

  • Novelty: ⭐⭐⭐⭐
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐
  • Writing Quality: ⭐⭐⭐⭐
  • Value: ⭐⭐⭐⭐