RL-ScanIQA: Reinforcement-Learned Scanpaths for Blind 360° Image Quality Assessment¶
Conference: CVPR 2026 arXiv: 2603.14297 Code: None Area: Model Compression Keywords: 360 image quality, reinforcement-learning, scanpath, blind IQA, PPO, active perception
TL;DR¶
This paper proposes RL-ScanIQA, the first end-to-end reinforcement learning-based blind 360° image quality assessment (BIQA) framework. The core idea is to formulate scanpath generation as a sequential decision-making process, using a PPO policy to learn task-driven viewing strategies directly from quality assessment feedback, rather than relying on imitation learning from human gaze data. The framework consists of two jointly optimized modules—a scanpath generator and a quality assessor—augmented with multi-level rewards (step-level exploration, set-level diversity, and task-aligned perception) and distortion-space data augmentation. The method achieves state-of-the-art performance and strong cross-dataset generalization on three benchmarks: CVIQD, OIQA, and JUFE.
Background & Motivation¶
- Viewport constraints in 360° imagery: Panoramic images in immersive environments can only be experienced progressively through limited viewports; quality perception depends on the viewing trajectory rather than the full image.
- Decoupled scanpath and quality assessment in prior methods: Existing scanpath methods treat path generation as an independent preprocessing step, precluding end-to-end optimization and misaligning paths with IQA objectives.
- Dependence on human gaze data: Prior methods require human eye-tracking data for supervision, which is costly to collect and may bias toward salient content rather than quality-relevant regions.
- ERP projection distortion: Directly analyzing equirectangular projections introduces spatial bias and ignores spherical geometry.
- Limitations of fixed sampling strategies: Methods based on predefined viewports ignore the sequential and content-adaptive nature of user exploration.
- Poor cross-dataset generalization: Large variation in distortion types across datasets causes fixed-strategy methods to degrade sharply in cross-domain scenarios.
Method¶
Scanpath Generator (PPO Policy Network)¶
The sphere is discretized into \(8 \times 4 = 32\) candidate viewports (\(90° \times 90°\) FOV), modeled as a finite-horizon MDP: - State: \(s_t = [h_{t-1}; g]\), where \(h_{t-1}\) is the GRU history hidden state and \(g\) is the global image descriptor extracted by DINOv2. - Action: Attention scoring over candidate viewport features + Softmax selection of the next viewport. - Optimization: PPO with clipped objective, GAE advantage estimation, and entropy regularization.
Multi-Level Reward Design¶
A. Step-Level Exploration Reward: \(r_t = \lambda_{\text{ent}} \cdot \mathcal{H}(x_t) + \lambda_{\text{ssim}} \cdot (1-\text{SSIM}) + \lambda_{\text{nov}} \cdot \delta_{\text{new}} + \lambda_{\text{eqb}} \cdot \mathcal{B}(x_t)\) - Information entropy encourages attention to texture-rich regions; SSIM dissimilarity promotes diverse exploration; a novelty signal prevents revisiting; an equatorial bias prior simulates human fixation habits.
B. Scanpath Diversity Reward: \(\mathcal{R}_{\text{div}} = \beta_{\text{cov}} \cdot \frac{|\cup_k S_k|}{X} - \beta_{\text{jac}} \cdot \text{mean Jaccard similarity}\) - Encourages \(K\) paths to cover a larger spherical area while penalizing inter-path overlap.
C. Task-Aligned Perceptual Reward: MSE negative reward \(\mathcal{R}_{\text{mse}}\) + ranking reward \(\mathcal{R}_{\text{rank}}\) - Feedback directly derived from IQA prediction error, aligning path generation with the quality prediction objective.
Quality Assessor¶
- Attention-weighted aggregation of viewport features: \(\alpha_t\) is computed via interaction between local feature \(f_t\) and global feature \(g\).
- The aggregated representation is concatenated with the global feature and passed through an MLP to regress the quality score.
- Final score is averaged over predictions from \(K\) paths.
Cross-Domain Augmentation¶
- Consistency loss: Predictions after weak augmentation should remain stable.
- Triplet loss: Score ordering constraint among clean / mildly distorted / heavily distorted images.
- Cross-ranking loss: Relative quality relationships between image pairs are preserved after augmentation.
Key Experimental Results¶
Table 1: In-Dataset Evaluation (SRCC / PLCC)¶
| Method | JUFE | OIQA | CVIQD |
|---|---|---|---|
| NIQE (handcrafted) | 0.552 / 0.592 | 0.745 / 0.736 | 0.893 / 0.872 |
| MC360IQA | 0.502 / 0.623 | 0.875 / 0.906 | 0.877 / 0.892 |
| Assessor360 | 0.489 / 0.510 | 0.979 / 0.945 | 0.958 / 0.963 |
| GSR-X | 0.843 / 0.857 | 0.922 / 0.937 | 0.805 / 0.957 |
| Q-Insight (LLM) | 0.557 / 0.412 | 0.643 / 0.795 | 0.872 / 0.801 |
| RL-ScanIQA | 0.816 / 0.902 | 0.941 / 0.967 | 0.970 / 0.970 |
RL-ScanIQA achieves the highest PLCC on all datasets, and also the best SRCC on CVIQD. On JUFE, PLCC substantially outperforms the runner-up (0.902 vs. 0.857), demonstrating the advantage of the RL policy under real-world distortion distributions.
Table 2: Cross-Dataset Evaluation (SRCC / PLCC)¶
| Method | Train: CVIQD → Test: OIQA / JUFE | Train: JUFE → Test: CVIQD / OIQA |
|---|---|---|
| Assessor360 | 0.853/0.632 — 0.887/0.749 | 0.617/0.724 — 0.405/0.499 |
| GSR-X | 0.804/0.765 — 0.831/0.694 | 0.782/0.732 — 0.733/0.611 |
| F-VQA(A) | 0.772/0.621 — 0.604/0.509 | 0.665/0.679 — 0.683/0.732 |
| RL-ScanIQA | 0.901/0.800 — 0.913/0.822 | 0.771/0.755 — 0.802/0.833 |
Cross-dataset generalization is significantly superior to all baselines, validating the effectiveness of distortion augmentation and ranking consistency constraints.
Highlights & Insights¶
- First end-to-end RL-based 360° IQA framework: Jointly optimizes scanpath generation and quality assessment without requiring human eye-tracking data.
- Sophisticated multi-level reward design: From step-level to set-level to task-level, sparse IQA supervision is transformed into dense shaping signals.
- Counterintuitive yet valuable finding: Human real fixation trajectories perform worse than RL-learned paths (Table 3: SRCC 0.724 → 0.816), suggesting that humans tend to fixate on salient rather than quality-critical regions.
- Strong cross-domain generalization: Distortion-space augmentation combined with ranking consistency losses enables effective transfer across diverse distortion types.
- Intuitive and convincing visualizations: Paths for high-quality images cover the sphere uniformly, while paths for low-quality images concentrate on distorted regions.
Limitations & Future Work¶
- High computational overhead: Inference requires \(K=15\) paths \(\times\) \(T=7\) steps \(= 105\) viewport feature extractions, limiting real-time applicability.
- Coarse viewport discretization: 32 candidate viewports may be insufficient to precisely localize fine-grained distortion regions.
- Limited benchmark coverage: Only three datasets are evaluated; panoramic IQA datasets are small in scale, with CVIQD and OIQA each containing only a few hundred images.
- Frozen DINOv2 feature extractor: A frozen pretrained model may not be the most distortion-sensitive feature extraction solution.
- Dependence on MOS annotations: Training still requires precise human subjective quality scores, incurring non-trivial annotation costs.
- Heavy reward hyperparameter burden: Four step-level weights, two diversity weights, two task-alignment weights, and five loss function weights yield a heavy tuning burden.
Related Work & Insights¶
- 2D BIQA: BRISQUE (natural scene statistics), DBCNN, TreS, MANIQA (Transformer), Q-Insight (multimodal RL + LLM).
- 360° BIQA: MC360IQA (multi-branch CNN with fixed viewports), VGCN (graph convolution over viewport relations), Assessor360 / GSR-X / F-VQA (scanpath modeling with decoupled training).
- RL for visual tasks: Viewpoint planning, video summarization, attention selection; PPO combined with variance reduction and value guidance demonstrates robustness under sparse rewards.
- 360° visual exploration: Eye-tracking studies reveal human viewing characteristics such as equatorial bias and salient object preference.
Rating¶
| Dimension | Score (1–10) | Notes |
|---|---|---|
| Novelty | 8 | First end-to-end RL integration into 360° IQA; the joint path + assessment optimization paradigm is novel. |
| Technical Contribution | 8 | Multi-level reward design is well-motivated; cross-domain augmentation strategy is effective. |
| Experimental Thoroughness | 7 | Three datasets and complete ablations, but dataset scale is limited. |
| Writing Quality | 8 | Clear structure, rich figures and tables, comprehensive comparisons. |
| Value | 7 | Demand for 360° IQA is growing, but inference overhead and hyperparameter count may hinder deployment. |
| Overall | 7.6 | An excellent work introducing active perception into 360° quality assessment; the end-to-end RL paradigm offers meaningful inspiration. |