LLM-Guided Probabilistic Fusion for Label-Efficient Document Layout Analysis¶

Conference: CVPR 2026
Paper: CVF Open Access
Code: To be confirmed
Area: Document Intelligence / Semi-supervised Object Detection
Keywords: Document Layout Analysis, Semi-supervised Detection, LLM Structural Prior, Inverse Variance Fusion, Pseudo-labeling

TL;DR¶

This paper integrates text-pretrained LLMs as "structural prior generators" into the pseudo-label refinement stage of semi-supervised layout detection. By using OCR+LLM to infer document hierarchical regions and performing inverse variance probabilistic fusion (including learnable instance-adaptive gating) with teacher detector outputs, the method achieves 88.2 AP (lightweight backbone) and 89.7 AP (LayoutLMv3) on PubLayNet using only 5% labels, with the most significant gains observed in rare layout elements such as titles and headers.

Background & Motivation¶

Background: Document layout analysis serves as the foundation for digital libraries, form processing, and document QA. Modern Transformer-based detectors offer high precision but rely heavily on large-scale annotations; semi-supervised learning (teacher-student, pseudo-labeling, consistency regularization) is the mainstream approach to reducing annotation costs.

Limitations of Prior Work: Semi-supervised detectors often inherit systematic biases from teacher models, struggling specifically with rare layout elements (caption, footer) and fine-grained distinctions (caption vs. footer, header vs. title)—categories that are inherently sparse and visually similar. Perception-only pseudo-labels lack "semantic structural" information.

Key Challenge: Humans interpret document structure based on textual semantics—"Figure 3:" suggests a caption, bold text at the page top suggests a header, and tabular alignment suggests a data table. Existing semi-supervised detection relies solely on visual cues, wasting this free linguistic prior. Conversely, directly replacing detectors with LLM/VLMs is ineffective (GPT-4V zero-shot only achieves 74.3 AP in experiments) due to their weak spatial localization capabilities.

Goal: To inject the structural reasoning capabilities of LLMs into pseudo-label refinement without discarding mature detection architectures or introducing large-scale annotations. This is decomposed into two sub-problems: (i) how to fused "LLM-provided structural regions" with "teacher-provided visual boxes" in a principled manner rather than simple concatenation; (ii) how to make fusion weights adaptive and theoretically guaranteed.

Key Insight: Positioning the LLM as a "structural prior generator" rather than a detector replacement—providing it with OCR-extracted text blocks and their coordinates allows it to infer hierarchies, distinguish semantic regions, and even correct OCR errors. This knowledge is complementary to visual pattern recognition.

Core Idea: Combining LLM structural priors with teacher visual predictions via "inverse variance probabilistic fusion + learnable instance gating" to generate refined pseudo-labels. Higher weights are assigned to more certain sources, with the cross-modal fusion efficiency explained through data-dependent PAC bounds.

Method¶

Overall Architecture¶

The pipeline is built upon a lightweight DETR-style detector (SwiftFormer-Tiny backbone, 3-layer encoder/decoder, 100 object queries). For unlabeled documents: one path uses Tesseract OCR to extract text blocks $B=\{(b^{ocr}_j,t_j)\}$, which are fed to the LLM to return structural regions with boxes, categories, and confidence $r_k=(b^{llm}_k,c_k,s_k)$; the other path generates visual predictions $T$ via the teacher detector. The two paths are aligned through IoU matching and combined via probabilistic fusion to generate refined pseudo-labels. The student detector is trained on these labels while aligning visual queries with OCR text embeddings through cross-modal consistency; the student updates the teacher via EMA.

graph TD
    A["Unlabeled Documents"] --> B["LLM Structural Prior Fusion<br/>OCR text extraction -> LLM region inference + Teacher visual prediction -> IoU Alignment"]
    B --> C["Adaptive Probabilistic Fusion<br/>Inverse Variance Weighting + Learnable Instance Gating (PAC Bound)"]
    C --> D["Refined Pseudo-labels"]
    D --> E["Student Detector Training<br/>+ Cross-modal Consistency Constraint (CLIP Text <-> Visual Query)"]
    E -->|EMA Momentum 0.999| F["Update Teacher -> Regenerate Pseudo-labels"]
    F -.Loop.-> B

Key Designs¶

1. LLM Structural Prior Fusion: Injecting Linguistic Hierarchy into Visual Pseudo-labels

Addressing the pain point where purely visual pseudo-labels lack semantic structure. OCR decomposes unlabeled documents into text blocks, and the LLM is prompted to identify structural regions, returning $r_k=(b^{llm}_k,c_k,s_k)$. LLM regions $L$ and teacher predictions $T$ are aligned via IoU matching: when $\mathrm{IoU}(b^t_i,b^{llm}_k)\ge\tau$ and categories are compatible, a fused prediction is generated—box $b_f=\alpha b^t_i+(1-\alpha)b^{llm}_k$ and confidence $p_f=\sigma(w_t\cdot\mathrm{logit}(p^t_i)+w_l\cdot\mathrm{logit}(s_k))$ (with fixed settings $\alpha=0.6,w_t=0.7,w_l=0.3$). Unmatched high-confidence LLM regions are added as soft pseudo-labels (label smoothing $\epsilon=0.2$) to assist rare classes. LLM regions are cached offline to amortize inference costs.

2. Adaptive Probabilistic Fusion: Using Uncertainty to Determine Trust with PAC Guarantees

Fixed weights are suboptimal heuristics. This design starts from uncertainty quantification: teacher uncertainty is estimated via prediction variance $\sigma^2_t=\mathrm{Var}(p^{t,1}_i,\dots)$, while LLM uncertainty depends on text evidence quality $\sigma^2_l=1/(Q_{text}(t_k)\cdot Q_{spatial}(b^{llm}_k))$, where $Q_{text}$ measures text clarity and $Q_{spatial}$ measures spatial consistency. The minimum-variance unbiased estimator provides inverse variance weighted localization:

\[b_f=\frac{b^t_i/\sigma^2_t+b^{llm}_k/\sigma^2_l}{1/\sigma^2_t+1/\sigma^2_l}\]

Weights are assigned by precision $1/\sigma^2$; more certain sources receive higher weights. Confidence is calculated as the geometric mean in logit space after temperature scaling. Since real predictions violate Gaussian assumptions, a learnable gating MLP is added to predict instance-level weights $\alpha_{adapt}$ from $[h^t_i;h^l_k;\mathrm{IoU};p^t_i;s_k;Q_{text};Q_{spatial}]$. This adds only 64K parameters (0.24% overhead) yet yields $+0.9$ AP. Theoretically, Theorem 1 establishes the variance optimality of inverse variance fusion, while Theorem 2 provides a data-dependent generalization bound, defining a complementarity dimension $k=\dim(\xi)\cdot \log(1+LB_\theta\sqrt{n})$. With $\dim(\xi)=3$ and $n=26\mathrm{K}$, $k\approx 22$, much smaller than $d=64\mathrm{K}$, predicting a convergence rate of $O(\sqrt{k/n})$.

3. Cross-modal Consistency: Stabilizing Training with Textual Semantics

Noisy pseudo-labels can bias the student. For each predicted box $\hat{b}_i$, OCR blocks with IoU > 0.5 are aggregated to obtain text $t_i$, encoded into $f_t(t_i)$ using a frozen CLIP text tower. Simultaneously, the decoder query $q_i$ is mapped to visual features $f_v(q_i)$ via a projection head. A consistency loss encourages alignment: $L_{cons}=\frac{1}{N}\sum_i \mathbb{1}_{\{t_i\neq\varnothing\}}\big(1-\frac{f_v(q_i)\cdot f_t(t_i)}{\lVert f_v(q_i)\rVert\lVert f_t(t_i)\rVert}\big)$. Freezing the text encoder during training prevents overfitting to OCR errors and ensures the detector learns representations robust to pseudo-label noise.

Loss & Training¶

The total objective is $L=L_{sup}(D_{labeled})+\lambda_{pseudo}L_{pseudo}(D_{unlabeled})+\lambda_{cons}L_{cons}(D_{unlabeled})$, where $L_{sup}$ is the standard DETR loss (focal classification + L1 + GIoU, $\lambda_{box}=5.0,\lambda_{giou}=2.0$), $\lambda_{pseudo}=1.0$, and $\lambda_{cons}=0.2$. A curriculum training strategy is employed: epochs 1–2 use high-confidence teacher pseudo-labels ($p^t_i\ge 0.7$); epochs 3–5 introduce teacher-LLM fused predictions; from epoch 6, LLM-only soft pseudo-labels are added for rare classes. The teacher is updated via EMA (0.999 momentum), and pseudo-labels are regenerated every 2 epochs.

Key Experimental Results¶

Metrics: COCO-style AP / AP75 / APS. Low-data settings randomly sample 5% or 10% labels.

Main Results¶

PubLayNet (5 classes, 5% labels) main results:

Category	Method	Labels	AP	Description
Supervised Upper Bound	Supervised (100%)	100%	91.4	Performance ceiling
Semi-supervised Baseline	Dense Teacher	5%+U	85.3	Strong semi-supervised baseline
Semi-supervised Baseline	STEP-DETR	5%+U	84.8	Transformer extension
Ours-Lightweight	Ours (SwiftFormer 26M, adaptive)	5%+U	88.2	+2.9 over Dense Teacher
Document Pre-trained	LayoutLMv3 + Semi-supervised	5%+U	89.1	Multimodal backbone
Document Pre-trained	UDOP	5%+U	89.8	Requires 100M+ page pre-training
Ours-Pre-trained	Ours + LayoutLMv3 (adaptive)	5%+U	89.7	Outperforms LayoutLMv3+SSL, matches UDOP
Zero-shot Control	GPT-4V (zero-shot)	0%	74.3	LLM cannot serve as direct detector

The lightweight backbone (26M parameters, no multimodal pre-training) matches LayoutLMv3 fine-tuned on 5% labels (87.6 AP). By using LayoutLMv3 as a teacher, Ours reaches 89.7 AP, matching UDOP (which requires massive multimodal pre-training) using only a text-pretrained LLM.

Ablation Study¶

PubLayNet (5% labels) component-wise analysis:

Configuration	AP	Δ	Description
Baseline	82.3	-	Detector only
+ Teacher	84.1	+1.8	Teacher pseudo-labels
+ LLM only	85.6	+3.3	LLM structural regions only
+ Fusion	86.7	+4.4	Inverse variance fusion
+ Cross-modal (Full)	87.3	+5.0	Complete model
w/o LLM	84.3	−3.0	Removing LLM causes largest drop
w/o Fusion	86.1	−1.2	Reverting to simple concatenation
w/o $L_{cons}$	86.7	−0.6	Removing consistency loss

Key Findings¶

The LLM structural prior provides the most significant gain: removing it drops performance by 3.0 AP, confirming that semantic structure provides information unattainable through vision alone.
LLM gains are concentrated in rare classes (DocLayNet): Caption +8.4, Header +7.2, and Title +6.8 relative to baseline, whereas common elements like Text/Paragraph only improve by +2~3.
Learnable adaptive gating consistently outperforms fixed weights: +0.9 AP for lightweight models and +0.3 AP for pre-trained models, with PAC bounds correctly predicting $O(\sqrt{k/n})$ convergence.
Inexpensive/open-source LLMs are viable: GPT-4o-mini costs only $12 per 50K pages, and Llama-3-70B provides nearly identical performance, supporting privacy-sensitive scenarios.

Highlights & Insights¶

The positioning of "LLM as prior generator, not detector" is critical—it bypasses the weak spatial localization of VLMs (GPT-4V zero-shot 74.3 AP) while utilizing their strength in textual reasoning.
Inverse variance fusion provides a principled answer to "whom to trust," while the learnable gate handles cases where Gaussian assumptions fail. The 64K parameter gate and the data-dependent PAC bound ($k\approx22\ll d$) explain successful learning under low-data regimes.
The concentration of LLM gains in rare layout classes suggests a generalizable strategy for long-tail detection: using textual/attribute descriptions to supplement pseudo-labels for sparse categories.

Limitations & Future Work¶

Strong dependency on OCR quality: The LLM path relies on Tesseract blocks; poor scan quality or complex layouts may contaminate structural inference.
Heuristic uncertainty estimation: The definitions for $Q_{text}$ and $Q_{spatial}$ in $\sigma^2_l$ are relatively coarse. The PAC bounds are also noted to be somewhat loose (predicting a 2.3 mAP gap vs. 0.7 mAP observed).
Limited evaluation scope: Experiments are restricted to PubLayNet/DocLayNet (5/11 classes). Generalization to table structure recognition or handwritten documents remains to be tested.
Complexity: The introduction of OCR+LLM adds an external dependency chain, increasing deployment complexity compared to vision-only semi-supervised methods.

vs. Dense Teacher / STEP-DETR (Visual Semi-supervised): These methods rely solely on visual cues and suffer from teacher bias in rare classes; Ours injects LLM semantic priors, yielding a +2.9 AP gain over Dense Teacher at 5% labels.
vs. UDOP / DocLLM (Unified Pre-training): These require massive multimodal corpora and compute; Ours uses text-only LLMs as plug-and-play priors, matching UDOP's performance with only 5% labels.
vs. GPT-4V Zero-shot: Direct VLM detection is far inferior to supervised baselines; this work proves that "collaboration" (LLM prior + specialized detector) is superior to "replacement."

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of LLM structural priors with inverse variance/gated fusion and PAC validation is a fresh entry into semi-supervised layout analysis.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Includes dual benchmarks, multiple backbones, per-class analysis, and extensive statistical/theoretical validation.
Writing Quality: ⭐⭐⭐⭐ Motivations are clear, though the theoretical sections are notation-heavy.
Value: ⭐⭐⭐⭐ Provides a practical "low-label + language prior" route that is cost-effective and friendly to rare classes.