GLYPH-SR: Can We Achieve Both High-Quality Image Super-Resolution and High-Fidelity Text Recovery via VLM-Guided Latent Diffusion Model?¶
Conference: ICLR 2026 arXiv: 2510.26339 Code: Available (noted for release in the paper) Area: Diffusion Models Keywords: Image Super-Resolution, Scene Text Recovery, ControlNet, Diffusion Models, OCR
TL;DR¶
This paper proposes GLYPH-SR, a vision-language-guided diffusion framework that simultaneously optimizes image quality and text readability via a dual-branch Text-SR fusion ControlNet and a ping-pong scheduler, achieving a 15.18-point improvement in OCR F1 on SVT ×8.
Background & Motivation¶
Image super-resolution (SR) serves as a foundational technique for many vision systems; however, existing SR methods suffer from two systematic biases: (1) Metric bias — global metrics such as PSNR/SSIM assign negligible weight to small text regions (typically less than 1% of the image), so character corruption incurs almost no penalty; (2) Objective bias — commonly used training losses treat text as ordinary high-frequency texture rather than the discrete semantic units required by OCR. These biases give rise to two failure modes: hallucination (generating sharp but incorrect characters) and conservative recovery (retaining blur without improvement). The core problem is achieving visual realism and text readability simultaneously — two objectives that exhibit significant tension.
Method¶
Overall Architecture¶
GLYPH-SR builds upon a pretrained LDM (Juggernaut-XL) and augments it with a Text-SR fusion ControlNet (TS-ControlNet). OCR is used to extract text–position pairs that provide word-level semantic guidance, while a ping-pong scheduler alternates between text-centric and image-centric guidance throughout the denoising process.
Key Designs¶
-
Condition Decomposition
- Function: Explicitly separates guidance signals into image-oriented and text-oriented components.
- Mechanism: A scene-level caption \(\mathcal{S}_{\text{IMG}}\) summarizes global attributes (lighting, composition, etc.); an OCR module detects \(K\) text instances and returns position–text pairs \(\{(\mathcal{S}_{\text{text}}^k, \mathcal{S}_{\text{pos}}^k)\}_{k=1}^K\), which are converted into structured natural-language prompts (e.g., "HSBC appears at the center of the image").
- Design Motivation: When guidance is provided only in aggregated form, small text regions are still treated as generic high-frequency texture.
-
Text-SR Fusion ControlNet (TS-ControlNet)
- Function: Balances image quality and text readability while preserving the generative prior.
- Mechanism: A dual-branch architecture — the SR branch is frozen to maintain overall image quality, while the text branch is trainable and focuses on glyph recovery. Residual mixed injection is formulated as: \(c = \frac{1}{2} s_{\text{ctrl}} [\mathcal{C}_{\text{SR}}(z_t; \phi_{\text{img}}(\mathcal{S}_{\text{IMG}}+P)) + \mathcal{C}_{\text{TXT}}(z_t; \phi_{\text{txt}}(\mathcal{S}_{\text{TXT}}+P))]\)
- Design Motivation: Directly separating the two guidance signals improves text but degrades non-text regions.
-
Ping-Pong Scheduler
- Function: Dynamically reweights text and image guidance along the denoising trajectory.
- Mechanism: A time-dependent coefficient \(\lambda_t\) modulates both embedding fusion and residual injection. A binary square-wave strategy alternates between \(\lambda_t=0\) (text-centric) and \(\lambda_t=1\) (image-centric) with switching period \(\tau=1\): \(\lambda_t = 0\) if \(\lfloor \frac{t-t_0}{\tau} \rfloor \bmod 2 = 0\), otherwise \(\lambda_t = 1\).
- Design Motivation: Continuous gradual transitions are less effective than a square wave; text-centric steps inject precise glyph cues, while image-centric steps stabilize global structure.
Loss & Training¶
- Standard \(\varepsilon\)-prediction objective: \(\mathcal{L}_{\text{text}} = \mathbb{E}_{z_0, t, \varepsilon} \| \varepsilon - \mathcal{D}_\theta(z_t, t, c) \|_2^2\)
- A four-partition synthetic corpus is constructed by independently perturbing glyph quality and global image quality, enabling targeted text recovery learning.
- The LDM backbone and SR branch are frozen; only the text branch is fine-tuned.
Key Experimental Results¶
Main Results (SVT ×4 OCR F1)¶
| Method | OpenOCR | GOT-OCR | LLaVA-NeXT | MANIQA | CLIP-IQA |
|---|---|---|---|---|---|
| DiffBIR | 38.73 | 42.33 | 45.19 | 47.82 | 58.66 |
| InvSR | 57.79 | 60.96 | 65.00 | 46.78 | 57.30 |
| PiSA-SR | 63.30 | 65.23 | 67.75 | 37.41 | 44.30 |
| GLYPH-SR | 67.54 | 71.72 | 73.22 | 47.75 | 59.40 |
Ablation Study (Contribution of Core Components)¶
| Configuration | OCR F1 | MANIQA | Notes |
|---|---|---|---|
| Condition decomposition only | Improved | Degraded | Non-text regions deteriorate |
| + TS-ControlNet | Further improved | Maintained | Dual-branch balancing |
| + Ping-Pong | Best | Competitive | Square wave outperforms continuous gradients |
Key Findings¶
- OCR F1 on SVT ×8 improves by up to 15.18 points over diffusion/GAN baselines.
- Validated across three datasets (SVT / SCUT-CTW1500 / CUTE80) at two scales (4× / 8×).
- OCR metrics improve substantially while MANIQA / CLIP-IQA / MUSIQ scores remain competitive.
Highlights & Insights¶
- Scene text SR is explicitly formulated as a dual-objective optimization problem, establishing for the first time a standardized dual-axis evaluation protocol.
- The four-partition synthetic data design is elegant: orthogonal perturbation of glyph and image quality enables decoupled learning.
- The ping-pong scheduler is simple yet effective, outperforming more complex continuous noise-level scheduling strategies.
Limitations & Future Work¶
- The pipeline depends on an OCR module to extract text locations, which may itself fail at low resolutions.
- Synthetic training data may not fully represent real-world degradation distributions.
- Evaluation is limited to 4× and 8× upscaling; performance at higher magnification factors remains unknown.
Related Work & Insights¶
- vs. StableSR / DiffBIR: These methods optimize perceptual quality but are insensitive to character integrity.
- vs. TATT and other text SR methods: Text-specific SR methods perform poorly in full-scene settings due to oversimplified scene assumptions.
Rating¶
- Novelty: ⭐⭐⭐⭐ The dual-objective SR framework and ping-pong scheduler constitute novel contributions.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive comparisons across three datasets and two scales.
- Writing Quality: ⭐⭐⭐⭐ Problem formulation is clear and motivation is well-justified.
- Value: ⭐⭐⭐⭐ Practical applicability to scene text SR tasks.