Unbiased Video Scene Graph Generation via Visual and Semantic Dual Debiasing¶
Conference: CVPR 2025
arXiv: 2503.00548
Code: None
Area: Graph Learning / Scene Graph Generation
Keywords: Video Scene Graph, Debiasing, Memory-Guided, Iterative Relation Generation, Long-Tailed Distribution
TL;DR¶
This paper proposes the VISA framework to debias video scene graph generation from both visual (Memory-Guided Sequence Modeling (MGSM) to reduce feature variance) and semantic (Iterative Relation Generator (IRG) to introduce hierarchical context and reduce dependence on biased priors) perspectives, significantly improving performance on tail categories on datasets like Action Genome.
Background & Motivation¶
Background: Video Scene Graph Generation (VidSGG) aims to structure video content into
Limitations of Prior Work: Biases originate from two key overlooked dimensions: (1) Visual bias—objects in videos suffer from large feature variance due to occlusion, blur, and scale variations, leading to unstable visual representations generated by Transformers, which causes the model to favor matching high-frequency entities; (2) Semantic bias—predicting predicates solely based on visual features lacks sufficient context, causing the model to degenerate into relying on frequency priors in the training set.
Key Challenge: Visual instability + insufficient semantic context \(\rightarrow\) models fail to distinguish fine-grained relations (e.g., "holding" vs. "touching") and resort to safe predictions of high-frequency predicates.
Goal: To simultaneously perform debiasing at both the visual and semantic levels, enabling the model to make accurate predictions even on tail categories.
Key Insight: On the visual side, exponential moving average memory is used to smooth feature variance (theoretically proved to reduce variance from \(\Sigma\) to \(\frac{\lambda\Sigma}{2}\)). On the semantic side, an iterative generator is employed to progressively supplement contextual information, increasing the KL divergence between the predictive posterior and the biased prior.
Core Idea: Memory-smoothed visual features + iteratively supplemented semantic context = dual visual and semantic debiasing.
Method¶
Overall Architecture¶
The input video frames pass through an object detector (Faster R-CNN + ResNet-101) to extract object region features, which are then fed into the two core modules of the VISA framework: MGSM to stabilize visual features, and IRG to iteratively generate relation predicates. The final output is a set of scene graph triplets for each frame, comprising three types of predicates: attention, spatial, and contact relations.
Key Designs¶
-
Memory-Guided Sequence Modeling (MGSM):
- Function: Stabilize the visual feature representations of objects in videos, reducing feature variance caused by occlusion/blur, etc.
- Mechanism: Maintains an exponential moving average memory \(M_i^{t+1} = (1-\lambda)M_i^t + \lambda v_i^t\) for each object, with theoretical proof that the variance of the memory is \(\text{Var}[M_i^t] = \frac{\lambda\Sigma}{2-\lambda} \approx \frac{\lambda\Sigma}{2}\) (significantly lower than the original variance \(\Sigma\)). Then, an adaptive weight \(W_i^t = \sigma(\text{MLP}(v_i^t))\) is used to fuse current and previous frame features. Finally, a dual-attention mechanism enhances the features using the memory as the Key and current features as the Value.
- Design Motivation: Traditional Transformers only focus on inter-frame self-attention, ignoring the temporal smoothness of object features. EMA memory provides a stable feature anchor at almost zero additional cost, where \(\lambda\) controls the smoothing intensity (0.04 for SGCLS, 0.06 for SGDET).
-
Iterative Relation Generator (IRG):
- Function: Reduces the model's reliance on biased priors through iteratively supplemented semantic context.
- Mechanism: Based on information-theoretic derivation: additional context \(S\) reduces conditional entropy \(H(r_{ij}|v_i,v_j,S) \leq H(r_{ij}|v_i,v_j)\), which is equivalent to increasing the KL divergence between the predictive posterior and the biased prior. The first iteration predicts a preliminary scene graph using basic features (visual + spatial + GloVe semantic embeddings). Subsequent iterations pass predicted triplet embeddings back to the relation generator as additional context through a Hierarchical Semantic Extractor (HSE) to progressively refine predictions.
- Design Motivation: In a single prediction, insufficient context forces the model to rely on priors. Iterative generation allows the model to utilize existing scene graph predictions as extra cues (e.g., "A is predicted as walking" helps evaluate "whether A is looking at something").
-
Hierarchical Semantic Extractor (HSE):
- Function: Extracts multi-scale semantic information from predicted triplets for the next iteration.
- Mechanism: Decomposes compound features into fine-grained subject/object representations, downsamples them with a stride-2 convolution, and concatenates them to capture multi-level context.
- Design Motivation: Simple concatenation fails to effectively fuse visual and semantic information; the hierarchical structure contributes a 1.4-2.4% gain in mR@50 in ablation studies.
Loss & Training¶
The total loss is \(L_{\text{total}} = L_p + L_e + L_{\text{contra}}\), where \(L_p\) and \(L_e\) are the cross-entropy losses for predicates and entities respectively, and \(L_{\text{contra}}\) is the contrastive loss (following TEMPURA). An AdamW optimizer is used with a learning rate of 1e-5, and the model is trained on a single RTX 4090 for 15 epochs.
Key Experimental Results¶
Main Results¶
Comparison of mR@K on the Action Genome dataset (With Constraint):
| Task | Metric | TEMPURA | FloCoDe | VISA | Gain |
|---|---|---|---|---|---|
| PREDCLS | mR@10 | 42.9 | 44.8 | 46.9 | +2.1 |
| SGCLS | mR@10 | 34.0 | 37.4 | 40.8 | +3.4 |
| SGDET | mR@10 | 22.6 | 24.2 | 27.3 | +3.1 |
Semi Constraint (closer to practical application) achieves larger improvements:
| Task | Metric | TEMPURA | VISA | Gain |
|---|---|---|---|---|
| PREDCLS | mR@20 | 44.5 | 56.3 | +11.8 |
| SGCLS | mR@20 | 39.5 | 52.6 | +13.1 |
| SGDET | mR@20 | 21.8 | 31.7 | +9.9 |
There are also 7-8% gains on the PVSG and 4DPVSG datasets.
Ablation Study¶
| Configuration | SGCLS mR@10 (Semi) | SGDET mR@10 (No Constr.) | Description |
|---|---|---|---|
| Full VISA | 47.8 | 30.7 | Full model |
| w/o MGSM | 45.6 | 27.9 | Visual debiasing contributes 2-3% |
| w/o IRG | 34.0 | - | Semantic debiasing contributes 13.8% |
| w/o HSE | -1.4~-2.4 mR@50 | - | Hierarchical structure is effective |
Key Findings¶
- Semantic debiasing (IRG) contributes far more than visual debiasing (MGSM): Performance drops by over 13 points when IRG is removed, compared to 2-3 points when MGSM is removed, indicating that insufficient semantic context is the primary source of bias.
- Diminishing returns for the number of iterations N: Increasing \(N\) from 1 to 4 yields only about 0.8% improvement, but training time doubles for \(N \geq 2\). In practice, \(N=1\) is sufficient.
- Tail categories show the most significant gains: Under SGDET No Constraint, performance on tail categories increases by 11.0%, verifying the effectiveness of debiasing.
- \(\lambda\) has different optimal values across tasks: SGCLS (with ground-truth boxes) uses a smaller 0.04, while SGDET (detection from scratch) uses a larger 0.06.
Highlights & Insights¶
- Combining theory and practice: Derives the variance reduction ratio of EMA using a Gaussian noise model, and the debiasing formulation of the iterative generator via information theory. The designs are guided by theory rather than post-hoc explanation.
- Orthogonality of dual debiasing: Visual and semantic debiasing address problems at different levels—the former improves feature quality, while the latter enhances the reasoning process. They contribute independently and are mutually complementary.
- Lightweight nature of EMA memory: Requires only a moving average buffer and one attention layer, significantly reducing visual feature variance with almost zero extra computational cost.
Limitations & Future Work¶
- Object detector remains the bottleneck: Detection failures of small objects (e.g., cup) skip relevant triplets, limiting overall performance.
- Limited iterative self-correction capability: Improvements quickly saturate as \(N\) increases, and self-generated semantic context cannot exceed the upper bound of the model's inherent capability.
- Dataset annotation noise: Action Genome contains incorrect and ambiguous annotations (e.g., "looking at a cup" vs. "near a cup"), affecting the fairness of evaluation.
- Main experiments validated on a single dataset: Although PVSG/4DPVSG are supplemented, Action Genome remains the only benchmark for comprehensive evaluation.
Related Work & Insights¶
- vs TEMPURA: TEMPURA uses contrastive learning to assist in debiasing but does not address visual noise. Built on top of it, VISA adds MGSM for visual stabilization and IRG for semantic iterations, achieving an 11-13 point improvement under Semi Constraint.
- vs FloCoDe: FloCoDe focuses on flow information to assist predicate prediction, but still suffers from visual instability. VISA's EMA memory fundamentally improves feature stability.
- vs Image SGG methods: The unique challenge of video SGG lies in temporal instability, and MGSM is specifically designed to tackle this problem.
Rating¶
- Novelty: ⭐⭐⭐⭐ The dual visual and semantic debiasing framework is novel, supported by solid theoretical derivations.
- Experimental Thoroughness: ⭐⭐⭐⭐ Three constraint settings, multiple datasets, and detailed ablations are provided, although the main evaluation benchmark is singular.
- Writing Quality: ⭐⭐⭐⭐ Theoretical derivations are clear, though notations are dense and some formulations could be further simplified.
- Value: ⭐⭐⭐⭐ Provides a novel debiasing paradigm for VidSGG, with significant gains on tail categories.