RL-ScanIQA: Reinforcement-Learned Scanpaths for Blind 360° Image Quality Assessment¶

CVPR 2026 Model Compression 360 image quality reinforcement-learning scanpath blind IQA PPO active perception

Conference: CVPR 2026 arXiv: 2603.14297 Code: None Area: Model Compression Keywords: 360 image quality, reinforcement-learning, scanpath, blind IQA, PPO, active perception

TL;DR¶

This paper proposes RL-ScanIQA, the first end-to-end reinforcement learning-based blind 360° image quality assessment (BIQA) framework. The core idea is to formulate scanpath generation as a sequential decision-making process, using a PPO policy to learn task-driven viewing strategies directly from quality assessment feedback, rather than relying on imitation learning from human gaze data. The framework consists of two jointly optimized modules—a scanpath generator and a quality assessor—augmented with multi-level rewards (step-level exploration, set-level diversity, and task-aligned perception) and distortion-space data augmentation. The method achieves state-of-the-art performance and strong cross-dataset generalization on three benchmarks: CVIQD, OIQA, and JUFE.

Background & Motivation¶

Viewport constraints in 360° imagery: Panoramic images in immersive environments can only be experienced progressively through limited viewports; quality perception depends on the viewing trajectory rather than the full image.
Decoupled scanpath and quality assessment in prior methods: Existing scanpath methods treat path generation as an independent preprocessing step, precluding end-to-end optimization and misaligning paths with IQA objectives.
Dependence on human gaze data: Prior methods require human eye-tracking data for supervision, which is costly to collect and may bias toward salient content rather than quality-relevant regions.
ERP projection distortion: Directly analyzing equirectangular projections introduces spatial bias and ignores spherical geometry.
Limitations of fixed sampling strategies: Methods based on predefined viewports ignore the sequential and content-adaptive nature of user exploration.
Poor cross-dataset generalization: Large variation in distortion types across datasets causes fixed-strategy methods to degrade sharply in cross-domain scenarios.

Method¶

Scanpath Generator (PPO Policy Network)¶

The sphere is discretized into \(8 \times 4 = 32\) candidate viewports (\(90° \times 90°\) FOV), modeled as a finite-horizon MDP: - State: \(s_t = [h_{t-1}; g]\), where \(h_{t-1}\) is the GRU history hidden state and \(g\) is the global image descriptor extracted by DINOv2. - Action: Attention scoring over candidate viewport features + Softmax selection of the next viewport. - Optimization: PPO with clipped objective, GAE advantage estimation, and entropy regularization.

Multi-Level Reward Design¶

A. Step-Level Exploration Reward: \(r_t = \lambda_{\text{ent}} \cdot \mathcal{H}(x_t) + \lambda_{\text{ssim}} \cdot (1-\text{SSIM}) + \lambda_{\text{nov}} \cdot \delta_{\text{new}} + \lambda_{\text{eqb}} \cdot \mathcal{B}(x_t)\) - Information entropy encourages attention to texture-rich regions; SSIM dissimilarity promotes diverse exploration; a novelty signal prevents revisiting; an equatorial bias prior simulates human fixation habits.

B. Scanpath Diversity Reward: \(\mathcal{R}_{\text{div}} = \beta_{\text{cov}} \cdot \frac{|\cup_k S_k|}{X} - \beta_{\text{jac}} \cdot \text{mean Jaccard similarity}\) - Encourages \(K\) paths to cover a larger spherical area while penalizing inter-path overlap.

C. Task-Aligned Perceptual Reward: MSE negative reward \(\mathcal{R}_{\text{mse}}\) + ranking reward \(\mathcal{R}_{\text{rank}}\) - Feedback directly derived from IQA prediction error, aligning path generation with the quality prediction objective.

Quality Assessor¶

Attention-weighted aggregation of viewport features: \(\alpha_t\) is computed via interaction between local feature \(f_t\) and global feature \(g\).
The aggregated representation is concatenated with the global feature and passed through an MLP to regress the quality score.
Final score is averaged over predictions from \(K\) paths.

Cross-Domain Augmentation¶

Consistency loss: Predictions after weak augmentation should remain stable.
Triplet loss: Score ordering constraint among clean / mildly distorted / heavily distorted images.
Cross-ranking loss: Relative quality relationships between image pairs are preserved after augmentation.

Key Experimental Results¶

Table 1: In-Dataset Evaluation (SRCC / PLCC)¶

Method	JUFE	OIQA	CVIQD
NIQE (handcrafted)	0.552 / 0.592	0.745 / 0.736	0.893 / 0.872
MC360IQA	0.502 / 0.623	0.875 / 0.906	0.877 / 0.892
Assessor360	0.489 / 0.510	0.979 / 0.945	0.958 / 0.963
GSR-X	0.843 / 0.857	0.922 / 0.937	0.805 / 0.957
Q-Insight (LLM)	0.557 / 0.412	0.643 / 0.795	0.872 / 0.801
RL-ScanIQA	0.816 / 0.902	0.941 / 0.967	0.970 / 0.970

RL-ScanIQA achieves the highest PLCC on all datasets, and also the best SRCC on CVIQD. On JUFE, PLCC substantially outperforms the runner-up (0.902 vs. 0.857), demonstrating the advantage of the RL policy under real-world distortion distributions.

Table 2: Cross-Dataset Evaluation (SRCC / PLCC)¶

Method	Train: CVIQD → Test: OIQA / JUFE	Train: JUFE → Test: CVIQD / OIQA
Assessor360	0.853/0.632 — 0.887/0.749	0.617/0.724 — 0.405/0.499
GSR-X	0.804/0.765 — 0.831/0.694	0.782/0.732 — 0.733/0.611
F-VQA(A)	0.772/0.621 — 0.604/0.509	0.665/0.679 — 0.683/0.732
RL-ScanIQA	0.901/0.800 — 0.913/0.822	0.771/0.755 — 0.802/0.833

Cross-dataset generalization is significantly superior to all baselines, validating the effectiveness of distortion augmentation and ranking consistency constraints.

Highlights & Insights¶

First end-to-end RL-based 360° IQA framework: Jointly optimizes scanpath generation and quality assessment without requiring human eye-tracking data.
Sophisticated multi-level reward design: From step-level to set-level to task-level, sparse IQA supervision is transformed into dense shaping signals.
Counterintuitive yet valuable finding: Human real fixation trajectories perform worse than RL-learned paths (Table 3: SRCC 0.724 → 0.816), suggesting that humans tend to fixate on salient rather than quality-critical regions.
Strong cross-domain generalization: Distortion-space augmentation combined with ranking consistency losses enables effective transfer across diverse distortion types.
Intuitive and convincing visualizations: Paths for high-quality images cover the sphere uniformly, while paths for low-quality images concentrate on distorted regions.

Limitations & Future Work¶

High computational overhead: Inference requires \(K=15\) paths \(\times\) \(T=7\) steps \(= 105\) viewport feature extractions, limiting real-time applicability.
Coarse viewport discretization: 32 candidate viewports may be insufficient to precisely localize fine-grained distortion regions.
Limited benchmark coverage: Only three datasets are evaluated; panoramic IQA datasets are small in scale, with CVIQD and OIQA each containing only a few hundred images.
Frozen DINOv2 feature extractor: A frozen pretrained model may not be the most distortion-sensitive feature extraction solution.
Dependence on MOS annotations: Training still requires precise human subjective quality scores, incurring non-trivial annotation costs.
Heavy reward hyperparameter burden: Four step-level weights, two diversity weights, two task-alignment weights, and five loss function weights yield a heavy tuning burden.

2D BIQA: BRISQUE (natural scene statistics), DBCNN, TreS, MANIQA (Transformer), Q-Insight (multimodal RL + LLM).
360° BIQA: MC360IQA (multi-branch CNN with fixed viewports), VGCN (graph convolution over viewport relations), Assessor360 / GSR-X / F-VQA (scanpath modeling with decoupled training).
RL for visual tasks: Viewpoint planning, video summarization, attention selection; PPO combined with variance reduction and value guidance demonstrates robustness under sparse rewards.
360° visual exploration: Eye-tracking studies reveal human viewing characteristics such as equatorial bias and salient object preference.

Rating¶

Dimension	Score (1–10)	Notes
Novelty	8	First end-to-end RL integration into 360° IQA; the joint path + assessment optimization paradigm is novel.
Technical Contribution	8	Multi-level reward design is well-motivated; cross-domain augmentation strategy is effective.
Experimental Thoroughness	7	Three datasets and complete ablations, but dataset scale is limited.
Writing Quality	8	Clear structure, rich figures and tables, comprehensive comparisons.
Value	7	Demand for 360° IQA is growing, but inference overhead and hyperparameter count may hinder deployment.
Overall	7.6	An excellent work introducing active perception into 360° quality assessment; the end-to-end RL paradigm offers meaningful inspiration.