GLYPH-SR: Can We Achieve Both High-Quality Image Super-Resolution and High-Fidelity Text Recovery via VLM-Guided Latent Diffusion Model?¶

Conference: ICLR 2026 arXiv: 2510.26339 Code: Available (noted for release in the paper) Area: Diffusion Models Keywords: Image Super-Resolution, Scene Text Recovery, ControlNet, Diffusion Models, OCR

TL;DR¶

This paper proposes GLYPH-SR, a vision-language-guided diffusion framework that simultaneously optimizes image quality and text readability via a dual-branch Text-SR fusion ControlNet and a ping-pong scheduler, achieving a 15.18-point improvement in OCR F1 on SVT ×8.

Background & Motivation¶

Image super-resolution (SR) serves as a foundational technique for many vision systems; however, existing SR methods suffer from two systematic biases: (1) Metric bias — global metrics such as PSNR/SSIM assign negligible weight to small text regions (typically less than 1% of the image), so character corruption incurs almost no penalty; (2) Objective bias — commonly used training losses treat text as ordinary high-frequency texture rather than the discrete semantic units required by OCR. These biases give rise to two failure modes: hallucination (generating sharp but incorrect characters) and conservative recovery (retaining blur without improvement). The core problem is achieving visual realism and text readability simultaneously — two objectives that exhibit significant tension.

Method¶

Overall Architecture¶

GLYPH-SR builds upon a pretrained LDM (Juggernaut-XL) and augments it with a Text-SR fusion ControlNet (TS-ControlNet). OCR is used to extract text–position pairs that provide word-level semantic guidance, while a ping-pong scheduler alternates between text-centric and image-centric guidance throughout the denoising process.

Key Designs¶

Condition Decomposition
- Function: Explicitly separates guidance signals into image-oriented and text-oriented components.
- Mechanism: A scene-level caption \(\mathcal{S}_{\text{IMG}}\) summarizes global attributes (lighting, composition, etc.); an OCR module detects \(K\) text instances and returns position–text pairs \(\{(\mathcal{S}_{\text{text}}^k, \mathcal{S}_{\text{pos}}^k)\}_{k=1}^K\), which are converted into structured natural-language prompts (e.g., "HSBC appears at the center of the image").
- Design Motivation: When guidance is provided only in aggregated form, small text regions are still treated as generic high-frequency texture.
Text-SR Fusion ControlNet (TS-ControlNet)
- Function: Balances image quality and text readability while preserving the generative prior.
- Mechanism: A dual-branch architecture — the SR branch is frozen to maintain overall image quality, while the text branch is trainable and focuses on glyph recovery. Residual mixed injection is formulated as: \(c = \frac{1}{2} s_{\text{ctrl}} [\mathcal{C}_{\text{SR}}(z_t; \phi_{\text{img}}(\mathcal{S}_{\text{IMG}}+P)) + \mathcal{C}_{\text{TXT}}(z_t; \phi_{\text{txt}}(\mathcal{S}_{\text{TXT}}+P))]\)
- Design Motivation: Directly separating the two guidance signals improves text but degrades non-text regions.
Ping-Pong Scheduler
- Function: Dynamically reweights text and image guidance along the denoising trajectory.
- Mechanism: A time-dependent coefficient \(\lambda_t\) modulates both embedding fusion and residual injection. A binary square-wave strategy alternates between \(\lambda_t=0\) (text-centric) and \(\lambda_t=1\) (image-centric) with switching period \(\tau=1\): \(\lambda_t = 0\) if \(\lfloor \frac{t-t_0}{\tau} \rfloor \bmod 2 = 0\), otherwise \(\lambda_t = 1\).
- Design Motivation: Continuous gradual transitions are less effective than a square wave; text-centric steps inject precise glyph cues, while image-centric steps stabilize global structure.

Loss & Training¶

Standard \(\varepsilon\)-prediction objective: \(\mathcal{L}_{\text{text}} = \mathbb{E}_{z_0, t, \varepsilon} \| \varepsilon - \mathcal{D}_\theta(z_t, t, c) \|_2^2\)
A four-partition synthetic corpus is constructed by independently perturbing glyph quality and global image quality, enabling targeted text recovery learning.
The LDM backbone and SR branch are frozen; only the text branch is fine-tuned.

Key Experimental Results¶

Main Results (SVT ×4 OCR F1)¶

Method	OpenOCR	GOT-OCR	LLaVA-NeXT	MANIQA	CLIP-IQA
DiffBIR	38.73	42.33	45.19	47.82	58.66
InvSR	57.79	60.96	65.00	46.78	57.30
PiSA-SR	63.30	65.23	67.75	37.41	44.30
GLYPH-SR	67.54	71.72	73.22	47.75	59.40

Ablation Study (Contribution of Core Components)¶

Configuration	OCR F1	MANIQA	Notes
Condition decomposition only	Improved	Degraded	Non-text regions deteriorate
+ TS-ControlNet	Further improved	Maintained	Dual-branch balancing
+ Ping-Pong	Best	Competitive	Square wave outperforms continuous gradients

Key Findings¶

OCR F1 on SVT ×8 improves by up to 15.18 points over diffusion/GAN baselines.
Validated across three datasets (SVT / SCUT-CTW1500 / CUTE80) at two scales (4× / 8×).
OCR metrics improve substantially while MANIQA / CLIP-IQA / MUSIQ scores remain competitive.

Highlights & Insights¶

Scene text SR is explicitly formulated as a dual-objective optimization problem, establishing for the first time a standardized dual-axis evaluation protocol.
The four-partition synthetic data design is elegant: orthogonal perturbation of glyph and image quality enables decoupled learning.
The ping-pong scheduler is simple yet effective, outperforming more complex continuous noise-level scheduling strategies.

Limitations & Future Work¶

The pipeline depends on an OCR module to extract text locations, which may itself fail at low resolutions.
Synthetic training data may not fully represent real-world degradation distributions.
Evaluation is limited to 4× and 8× upscaling; performance at higher magnification factors remains unknown.

vs. StableSR / DiffBIR: These methods optimize perceptual quality but are insensitive to character integrity.
vs. TATT and other text SR methods: Text-specific SR methods perform poorly in full-scene settings due to oversimplified scene assumptions.

Rating¶

Novelty: ⭐⭐⭐⭐ The dual-objective SR framework and ping-pong scheduler constitute novel contributions.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive comparisons across three datasets and two scales.
Writing Quality: ⭐⭐⭐⭐ Problem formulation is clear and motivation is well-justified.
Value: ⭐⭐⭐⭐ Practical applicability to scene text SR tasks.