ForCenNet: Foreground-Centric Network for Document Image Rectification¶
Conference: ICCV 2025 arXiv: 2507.19804 Code: https://github.com/caipeng328/ForCenNet Area: Document Analysis / Image Rectification Keywords: document image rectification, foreground guidance, curvature consistency loss, mask guidance, deformation field prediction
TL;DR¶
This paper proposes ForCenNet, a foreground-centric document rectification network featuring three key contributions: foreground label generation, a mask-guided Transformer decoder, and a curvature consistency loss. The method requires only undistorted images for training and achieves state-of-the-art performance on four benchmarks: DocUNet, DIR300, WarpDoc, and DocReal.
Background & Motivation¶
Document image rectification aims to remove geometric distortions from photographed documents to facilitate downstream tasks such as OCR. Existing methods face several critical challenges:
Foreground neglect: As shown in Figure 1, readable regions in documents (text, table lines) occupy only a small fraction of pixels, yet the most prominent distortions concentrate in the background. Existing methods (CGU-Net, DocRes, etc.) predict deformation fields uniformly over the entire image, leading to a misalignment between the task objective (foreground readability) and the optimization objective (per-pixel accuracy over the full image).
Incomplete foreground definition: DocGeoNet focuses on text-line masks, while FTDR uses frozen detection models to coarsely extract text-line information. However: (a) conventional detection models struggle to accurately recognize distorted text lines; (b) document foreground encompasses not only text but also table lines, figures, and other elements.
Scarcity of annotated data: Fine-grained annotations for document rectification are difficult to obtain. Weakly supervised methods circumvent this issue but sacrifice readability.
Core Idea: Detailed foreground elements (text, straight lines, figures) are extracted from undistorted images to construct a foreground-centric training framework that requires only undistorted reference images, enabling efficient iterative training.
Method¶
Overall Architecture¶
ForCenNet consists of three stages: (a) foreground label generation — foreground elements are extracted from undistorted images to produce training data; (b) foreground-centric network architecture — feature extraction + Transformer encoder + foreground segmentation + mask-guided decoder; (c) foreground-centric optimization — mask loss + deformation field regression + curvature consistency loss.
Key Designs¶
-
Foreground Label Generation: Derived entirely from undistorted images without manual annotation:
- Character-level foreground segmentation: Hi-SAM is fine-tuned to uniformly segment text regions, line elements, and figures, yielding the foreground mask \(M_u\).
- Line element extraction: An OCR engine extracts text lines (represented by the centerline of bounding boxes); a dedicated LSD-based document line detection algorithm (Algorithm 1) detects table lines, filtering non-horizontal/non-vertical lines and suppressing duplicate detections.
- Distortion field generation: Native backward mappings \(\mathcal{BM}\) are obtained from DOC3D, and the corresponding forward mappings \(\mathcal{FM}\) are computed. Distortion fields are augmented via random cropping and pairwise overlap; \(\mathcal{FM}\) is then applied to undistorted images, masks, and line elements to generate training samples.
-
Mask-Guided Transformer Decoder: Predicted foreground masks are used to guide feature extraction:
- Feature extraction module: large-kernel convolution + residual layers, outputting \(F_u \in \mathbb{R}^{H/8 \times W/8 \times 256}\).
- Efficient Transformer encoder: 3-layer vanilla Transformer with overlapping patch embedding (kernel=3, stride=2) and an SPW strategy to reduce attention complexity.
- Foreground segmentation module: a lightweight network predicts a binary mask \(M\), smoothed via softmax to obtain probability density \(\tilde{M}\).
- Mask-guided self-attention: \(\text{MSA}(Q,K,V) = \text{Softmax}(\frac{QK^T + \sigma \text{Seq}(\tilde{M})\text{Seq}(\tilde{M})^T}{\sqrt{d_{\text{head}}}})V\), directing attention toward foreground regions.
- Encoder–decoder cross-attention enables multi-scale information fusion.
-
Curvature Consistency Loss: A geometry-aware loss designed for fine-grained elements such as table lines:
- Motivation: Table line pixels are sparse, making L1 supervision weak; moreover, L1 captures only pixel offsets without encoding geometric structure.
- Points are sampled every 4 pixels along line elements to form the set \(P\); bilinear interpolation projects them onto the predicted and ground-truth deformation fields to obtain control points \(Cp\) and \(Cp_{gt}\).
- Curvature is computed via central differences: \(\kappa_i = \frac{|x'_i y''_i - y'_i x''_i|}{(x'^2_i + y'^2_i)^{3/2} + \varepsilon}\)
- The predicted curvature trend is constrained to match the ground-truth curvature: \(\mathcal{L}_k = \frac{1}{N-1}\sum_i^{N-1}(\hat{k_i} - k_i)\)
Loss & Training¶
The total loss comprises three terms: - \(\mathcal{L}_{seg}\): L1 loss on the foreground mask. - \(\mathcal{L}_{map}\): L1 regression loss on the backward mapping, \(\|\hat{\mathcal{BM}} - \mathcal{BM}\|_1\). - \(\mathcal{L}_k\): curvature consistency loss.
Training details: - Input resolution: 288×288; AdamW optimizer; batch size 32. - OneCycle learning rate schedule, maximum lr=10⁻⁴, 10% warmup. - 2× NVIDIA A100; trained for 30 epochs. - Two training variants: ForCenNet (365 undistorted images from DocUNet+DIR300) and ForCenNet-DOC3D (undistorted images from DOC3D).
Key Experimental Results¶
Main Results¶
Comparison on the DocUNet benchmark:
| Type | Method | MS-SSIM↑ | LD↓ | AD↓ | ED↓ | CER↓ |
|---|---|---|---|---|---|---|
| Weakly-sup. | PaperEdge | 0.470 | 8.49 | 0.39 | 825.48 | 0.211 |
| Weakly-sup. | FDRNet | 0.542 | 8.21 | – | 829.78 | 0.206 |
| Foreground | DocGeoNet | 0.504 | 7.71 | 0.38 | 713.94 | 0.182 |
| Foreground | FTDR | 0.497 | 8.43 | 0.37 | 697.52 | 0.170 |
| Foreground | LA-DocFlatten | 0.526 | 6.72 | 0.30 | – | – |
| Other | CGU-Net | 0.557 | 6.83 | 0.31 | 513.76 | 0.178 |
| Foreground | ForCenNet | 0.582 | 4.82 | 0.19 | 571.40 | 0.136 |
On the DIR300 benchmark, ForCenNet achieves MS-SSIM=0.713 and LD=4.65, with ED dropping below 400 for the first time (390.61).
Cross-domain generalization (WarpDoc and DocReal, without additional fine-tuning):
| Method | WarpDoc MS-SSIM↑ | WarpDoc LD↓ | DocReal MS-SSIM↑ | DocReal LD↓ |
|---|---|---|---|---|
| DocTr | 0.39 | 27.01 | 0.550 | 12.60 |
| CGU-Net | 0.35 | 26.28 | 0.549 | 11.33 |
| DocRes | 0.50 | 12.86 | 0.550 | 11.52 |
| ForCenNet | 0.54 | 8.10 | 0.595 | 6.95 |
Ablation Study¶
Module ablation on the DocUNet benchmark:
| ID | Mask-Guided (MGD) | Curvature Loss (CL) | MS-SSIM↑ | LD↓ | CER↓ |
|---|---|---|---|---|---|
| D | ✗ | ✗ | 0.530 | 7.06 | 0.198 |
| A | ✗ | ✓ | 0.558 | 5.44 | 0.173 |
| B | ✓ | ✗ | 0.565 | 5.10 | 0.169 |
| C | ✓ | ✓ | 0.571 | 4.95 | 0.141 |
Data scale ablation: using 65 undistorted images with varying distortion field augmentation multipliers:
| Multiplier | MS-SSIM↑ | LD↓ | CER↓ |
|---|---|---|---|
| ×1 | 0.449 | 10.745 | 0.291 |
| ×100 | 0.530 | 5.348 | 0.208 |
| ×500 | 0.566 | 4.892 | 0.149 |
| ×1000 | 0.571 | 4.950 | 0.141 |
| ×2000 | 0.567 | 4.965 | 0.147 |
Performance saturates at approximately ×500–×1000.
Key Findings¶
- The mask-guided module (MGD) contributes more substantially (MS-SSIM: 0.530→0.565), while the curvature consistency loss yields significant CER improvement (0.169→0.141).
- A strong model can be trained using only 65 undistorted images combined with distortion field augmentation, demonstrating high label generation efficiency.
- Freezing the foreground segmentation model causes a sharp performance drop (MS-SSIM: 0.571→0.468), confirming the necessity of differentiable end-to-end training.
- ForCenNet's line rectification outperforms DocRes, yielding greater total line segment length on 65% of DocReal and 69% of WarpDoc samples.
- Robust cross-domain performance demonstrates the strong generalization capability of the foreground-guided strategy.
Highlights & Insights¶
- Precise problem framing: The paper identifies the fundamental contradiction in document rectification — foreground regions most in need of correction occupy the fewest pixels — and proposes a systematic solution.
- Elegant label generation: Full training data can be generated automatically from undistorted reference images combined with random distortion fields, substantially reducing annotation costs.
- Comprehensive foreground definition: Text, table lines, and figures are treated as a unified set of foreground elements, providing broader coverage than DocGeoNet or FTDR.
- Novel curvature consistency loss: Constraining the deformation of line elements from a geometric perspective captures structural information beyond what pure L1 loss can express.
- Differentiable foreground segmentation: Ablation experiments validate the necessity of end-to-end learning; freezing the segmentation module leads to a large performance degradation.
Limitations & Future Work¶
- Foreground segmentation relies on fine-tuned Hi-SAM, which may not be robust to rare characters or extreme distortions.
- The fixed input resolution of 288×288 may lose fine details for high-resolution documents.
- Distortion fields sourced from DOC3D may not cover all real-world distortion types.
- Data scale augmentation exhibits a saturation point; more diverse distortion field designs may yield further gains.
- Illumination correction (shadows, uneven lighting) is not addressed; joint optimization with illumination rectification is a natural extension.
- Foreground masks could be further refined (e.g., distinguishing text, tables, and figures) to enable differentiated loss weighting.
Related Work & Insights¶
- DocGeoNet and FTDR impose text-line foreground constraints; ForCenNet extends this to line elements and figures, replacing coarse detection-based constraints with mask guidance and a curvature loss.
- Weakly supervised approaches (PaperEdge, FDRNet, DRNet) avoid manual annotation at the cost of readability; ForCenNet's label generation strategy combines the advantages of both paradigms.
- The curvature consistency loss concept is generalizable to other tasks requiring geometric consistency, such as map rectification and architectural facade correction.
- The mask-guided attention design is applicable to other vision tasks that require region-level focus.
Rating¶
- Novelty: ⭐⭐⭐⭐ The foreground label generation pipeline and curvature consistency loss are novel contributions, though the overall framework (encoder–decoder + deformation field) follows established paradigms.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Four real-world benchmarks, cross-domain evaluation, comprehensive ablations, rich visualizations, and detailed data-scale and structural ablations.
- Writing Quality: ⭐⭐⭐⭐ Well-structured with complete mathematical derivations; occasional symbol rendering artifacts (LaTeX artifacts) slightly impede readability.
- Value: ⭐⭐⭐⭐ Delivers practical value to the document rectification community; the label generation scheme lowers the barrier for real-world deployment.