HybriDLA: Hybrid Generation for Document Layout Analysis¶

Conference: AAAI 2026 arXiv: 2511.19919 Code: GitHub Area: Document Analysis / Object Detection Keywords: Document layout analysis, diffusion models, autoregressive generation, hybrid decoding, multi-scale feature fusion

TL;DR¶

HybriDLA is the first approach to unify diffusion-based bounding box refinement and autoregressive query expansion within a single decoding layer, simulating a human coarse-to-fine reading strategy for document layout analysis. It achieves 83.5% mAP on DocLayNet with a vision-only model, approaching multimodal systems.

Background & Motivation¶

Document layout analysis (DLA) is a fundamental task for document understanding and information extraction. Modern documents exhibit increasing structural complexity, with the number of layout elements per page ranging from as few as 2 to as many as 200. Traditional methods such as Faster R-CNN rely on a fixed number of region proposals, while DETR-based methods employ a fixed set of learnable queries. When the actual number of elements deviates significantly from the preset query count, fixed-query schemes either miss detections or introduce a large number of spurious "empty object" queries, incurring dual costs in efficiency and accuracy.

Existing diffusion-based detectors (e.g., DiffusionDet) introduce iterative refinement but still depend on a fixed-size pool of initial noisy boxes; autoregressive detectors (e.g., Pix2Seq) support variable-length sequences but incur linearly growing inference costs and lack spatial refinement capability.

HybriDLA's core motivation is to emulate the human document reading strategy: first scanning the entire page to grasp major regions, then progressively zooming into each region and dynamically adjusting the granularity of attention. Accordingly, the authors complementarily integrate diffusion-based refinement (responsible for iterative denoising of spatial coordinates) and autoregressive expansion (responsible for semantic awareness and dynamic query generation) within a unified decoder.

Method¶

Overall Architecture¶

HybriDLA adopts a two-stage hierarchical generation pipeline: a Feature Fusion Encoder (FFE) followed by a Hybrid Generative Decoder. The encoder processes multi-scale features extracted by the backbone to produce coarse layout priors; the decoder leverages these priors for autoregressive query expansion and diffusion-based box refinement, progressively generating precise layout elements and semantic labels in a coarse-to-fine manner across decoding layers.

Key Designs¶

Feature Fusion Encoder (FFE):
Function: Fuses the backbone's multi-scale feature maps \(F_{l=1}^{L}\) into a unified spatially-aware representation \(G\).
Mechanism: Consists of two steps—local feature encoding and cross-scale fusion. Local encoding \(H_l = \phi(F_l)\) combines self-attention and convolution to capture long-range dependencies and local texture patterns within each scale; cross-scale fusion \(G = \Psi(H_{l=1}^L)\) employs cross-scale attention and lateral convolutional layers to enable adaptive information exchange across scales, allowing fine-grained feature maps to acquire global context and coarse feature maps to acquire local detail.
Design Motivation: Layout elements in documents vary dramatically in scale (titles, footnotes, and figures differ greatly in size), making single-scale representations insufficient to capture both global structure and local detail simultaneously.
Autoregressive Query Expansion (AQE):
Function: Models the query generation process autoregressively, dynamically determining how many queries to generate and their semantic content.
Mechanism: Given image features \(X\), the model defines a joint distribution over a variable-length query sequence \(Q = (q_1, q_2, \dots, q_N)\), factorized as \(P(Q|X) = \prod_{t=1}^{N} P(q_t | X, q_{1:t-1}) \cdot P(\text{EOS} | X, q_{1:N})\). At each step, the next query is conditioned on the existing query context, and expansion terminates adaptively via a learned EOS criterion.
Design Motivation: The number of elements varies enormously across documents, inherently limiting fixed-query methods such as DETR. The autoregressive formulation enables the model to condition queries on prior context and dynamically adjust the query count according to document complexity.
Diffusion-based Refinement (DR):
Function: Models layout prediction as an implicit denoising operation, applying residual corrections to current predictions at each decoding layer.
Mechanism: The update rule is \(\hat{y}^{(t+1)} = \hat{y}^{(t)} + \Delta^{(t)}\), where \(\Delta^{(t)}\) denotes the predicted residual at step \(t\). Within each decoding layer, self-attention enables queries to share contextual information, cross-attention integrates visual features, and the feed-forward network applies the residual correction.
Design Motivation: Directly predicting precise coordinates is difficult; iterative denoising progressively eliminates spatial errors. During training, a subset of queries is initialized with perturbed ground-truth boxes (denoising queries), and layer-wise intermediate supervision is applied to accelerate convergence.

Loss & Training¶

Set prediction loss with Hungarian matching (consistent with DETR).
Denoising training: a subset of queries is initialized from perturbed ground-truth boxes, forcing the network to recover correct layouts from degraded inputs.
Layer-wise intermediate supervision: each decoding layer has auxiliary loss heads to ensure intermediate predictions remain close to ground truth.
All models are trained with a unified batch size of 40 for fair comparison.

Key Experimental Results¶

Main Results (DocLayNet)¶

Method Type	Backbone	Detector	mAP (%)
Traditional region-based	ResNet-101	Mask R-CNN	73.5
DETR-based	InternImage	RoDLA	80.5
DETR-based (multimodal)	—	DLAFormer	83.8
Diffusion-based	Swin-L	DiffusionDet	76.3
Autoregressive	ViT-L	Pix2Seq	72.5
Hybrid (Ours)	InternImage	HybriDLA	83.5
Hybrid (Ours)	Swin-L	HybriDLA	80.4
Hybrid (Ours)	ResNet-50	HybriDLA	74.4

HybriDLA achieves 83.5% mAP with vision-only input, trailing the multimodal DLAFormer by only 0.3%. Compared to methods using the same backbone, it achieves an average gain of approximately 3% mAP.

Main Results (M6Doc, 74 categories)¶

Method Type	Backbone	Detector	mAP (%)
Traditional region-based	DiT	Cascade R-CNN	70.2
DETR-based	InternImage	RoDLA	70.0
Diffusion-based	Swin-L	DiffusionDet	62.7
Hybrid (Ours)	InternImage	HybriDLA	71.4
Hybrid (Ours)	ViT-L	HybriDLA	68.6

Ablation Study¶

Configuration	mAP (%)	Note
DETR baseline (ResNet-50)	74.2	No AQE
+ AQE	74.4	Marginal gain from autoregressive expansion
Deformable DETR + AQE	76.3	Larger gain with stronger baseline
DINO + AQE	76.8	+1.5%
DE + DR + AQE (Swin-L)	78.1	Standard encoder
FFE + AQE (Swin-L)	79.1	FFE outperforms DE by 1.0%
FFE + DR + AQE (Swin-L)	80.4	DR adds further 1.3%
FFE + DR + AQE (InternImage)	83.5	Best configuration

Key Findings¶

The gain from AQE is more pronounced with stronger baseline detectors, indicating complementarity.
FFE yields clear benefits with large, feature-rich backbones (Swin-L: +2.3% vs. DE) but may degrade performance on smaller models.
DR provides consistent gains across nearly all backbones (0.8%–1.3%), with no change observed on ResNet-50.
Query count analysis reveals an optimal query budget per model: smaller models saturate at 30 expanded queries, while larger models require up to 300.

Highlights & Insights¶

HybriDLA is the first to unify diffusion-based and autoregressive generation paradigms for document layout analysis, representing a conceptually novel contribution.
The cognitively-motivated coarse-to-fine strategy has clear methodological correspondences: AQE corresponds to "scanning the page to discover new regions," and DR corresponds to "focusing on details for precise localization."
The vision-only model nearly matches the multimodal system (83.5% vs. 83.8%), demonstrating that significant headroom remains in exploiting purely visual features.
The architecture is backbone-agnostic and integrates seamlessly with ResNet, ViT, Swin, InternImage, and other backbone families.

Limitations & Future Work¶

The method relies solely on visual input and does not leverage multimodal signals such as OCR text and coordinate metadata, potentially limiting semantic disambiguation.
The hybrid generation mechanism incurs substantial inference cost, constraining real-time processing and large-scale deployment.
Future work may incorporate multimodal features or document metadata to enhance semantic understanding.
Model distillation or architectural optimization could reduce inference overhead.

The DETR family (DINO, Deformable DETR) provides a solid foundation for set prediction, and HybriDLA demonstrates the value of incorporating generative components on top of these baselines.
The iterative denoising concept from DiffusionDet is elegantly combined with autoregressive expansion, highlighting the complementarity of different generative paradigms in detection tasks.
The hybrid generation strategy proposed in this work is generalizable to general object detection in scenes with highly variable numbers of elements.

Rating¶

Novelty: ⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐
Value: ⭐⭐⭐⭐