Hierarchical Dual-Change Collaborative Learning for UAV Scene Change Captioning¶
Conference: CVPR2025
arXiv: 2603.12832
Code: None
Area: Remote Sensing
Keywords: UAV, Change Captioning, View Change, Transformer, Cross-Modal Alignment
TL;DR¶
This paper proposes a new task named UAV Scene Change Captioning (UAV-SCC) and a novel HDC-CL framework. It models the overlapping and non-overlapping regions of image pairs under moving viewpoints using a Dynamic Adaptive Layout Transformer, enhances viewpoint shift direction awareness via hierarchical cross-modal directional consistency calibration, and constructs a dedicated benchmark dataset.
Background & Motivation¶
Background¶
Background: Traditional change captioning assumes a fixed viewpoint with pixel-level aligned image pairs, focusing solely on describing semantic changes in the temporal dimension.
Limitations of Prior Work¶
Limitations of Prior Work: In UAV scenarios, the camera is in motion, introducing spatial layout inconsistency in image pairs due to viewpoint shift, with only partial scene overlap.
Key Challenge¶
Key Challenge: Two major challenges: (1) effectively modeling the relationship between overlapping and non-overlapping regions to handle parallax effects; (2) capturing direction clues brought by viewpoint motion to correctly interpret scene changes.
Core Idea¶
Core Idea: Existing methods primarily address changes in aligned scenes and cannot cope with the partial overlap and spatial layout inconsistencies caused by dynamic UAV viewpoints.
Method¶
Three Stages of the HDC-CL Framework¶
1. Image Alignment
-
Shift Voting Mechanism: Estimates the overlap mask of the image pair.
- Computes the pairwise feature similarity between patches of two images to find the best match and relative displacement \(\Delta\) for each patch.
- Votes to count the frequency of each \(\Delta\) and selects the \(\Delta^*\) with the highest cumulative similarity as the dominant displacement.
- Generates a binary common mask based on this to distinguish overlapping and non-overlapping regions.
-
Dynamic Adaptive Layout Transformer (DALT):
- Decomposes the features of each image into three types of regions: global (glo), common (com), and difference (diff).
- Assigns a learnable [CLS] token to each region category.
- Jointly models different regions in a unified multi-head self-attention encoder to obtain region-aware features.
2. Scene Change Distillation
- Context Feature Decoupling: Uses independent encoders (GE, CE, DE) to extract [CLS]-level semantics for global, common, and difference regions.
- Hierarchical Consistency Constraints:
- Global consistency (InfoNCE): Aligns the background semantics of the image pair.
- Region consistency (InfoNCE): Aligns the invariant semantics in overlapping regions.
- Independence regularization (HSIC): Minimizes statistical dependence between pre- and post-difference features to encourage capturing diverse change information.
- Scene Change Distillation: Cross-attention models cross-image correspondence in common regions, and a residual mechanism extracts local differences, which are fused with global differences to obtain a unified change representation D.
3. Caption Generation
- A Transformer decoder generates directional descriptions based on the change representation D.
- HCM-OCC (Hierarchical Cross-modal Directional Consistency Calibration):
- Computes the visual direction vector \(\Delta d = D_{forward} - D_{reverse}\)
- Computes the textual direction vector \(\Delta t = T_{forward} - T_{reverse}\)
- Bidirectional margin ranking loss aligns visual and textual directional semantics.
Loss & Training¶
\(\mathcal{L} = \mathcal{L}_{cap} + \lambda(\mathcal{L}_{con} + \mathcal{L}_{align})\)
Key Experimental Results¶
UAV-SCC Dataset¶
- UAV-SCCSimple: 9,017 image pairs, average caption length ~27 words, 3 captions/pair
- UAV-SCCRich: 7,054 image pairs, average caption length ~14 words, 5 captions/pair
Main Results (Comparison with 6 Baselines)¶
| Method | UAV-SCCSimple (B/M/R/C/S) | UAV-SCCRich (B/M/R/C/S) |
|---|---|---|
| CARD | 27.49/26.23/42.98/48.66/30.76 | 18.66/16.46/45.03/15.75/11.87 |
| HDC-CL | 31.13/27.34/44.58/54.68/33.09 | 19.26/18.45/44.32/19.16/13.00 |
- CIDEr score gains 6.02 on Simple and 3.41 on Rich (compared to the strongest baseline CARD).
- BLEU-4 gains 3.64 on Simple.
Ablation Study¶
- The joint utilization of the three losses (global, region, and HSIC) achieves the best performance.
- The effectiveness of the shift voting mechanism in DALT is validated through ablation.
- The HCM-OCC directional consistency calibration brings stable improvements.
- Using HSIC regularization alone gains 4.38 CIDEr (13.56 \(\rightarrow\) 17.94) on Rich.
- Combining the three losses achieves a peak CIDEr of 19.16 on Rich, validating the complementarity of hierarchical constraints.
Highlights & Insights¶
- Practical value of the new task definition: UAV-SCC fills the gap in change captioning under moving viewpoints, closer to real UAV applications than fixed-viewpoint change captioning.
- Elegant Shift Voting mechanism: Adaptively estimates overlapping regions without extra annotations to handle parallax.
- Unique direction-aware design: HCM-OCC aligns forward/reverse directional gaps with textual direction, enabling the model to perceive viewpoint shift directions.
- Contribution of comprehensive new benchmarks: Two dataset versions (Simple/Rich) are built to support evaluation at different granularities.
Limitations & Future Work¶
- The dataset scale is relatively limited (~9K pairs), which may be insufficient to train large-scale models.
- Features are extracted using only ResNet-101, with no experiments conducted using stronger visual backbones or pre-trained VLMs.
- Shift Voting assumes a single global shift, offering limited adaptability to complex rotation or scale transformations.
- Captioning evaluation metrics (BLEU, METEOR, CIDEr) may not fully reflect the accuracy of directional descriptions.
- Lack of comparison with large multimodal models (e.g., GPT-4V).
- The dataset is constructed based on images from existing public datasets, which may limit scene diversity.
- The forward and reverse directions of descriptions only consider two reciprocal directions, without modeling more complex rotational directional relationships.
Rating¶
- Novelty: ⭐⭐⭐⭐ (new task + full methodology + new dataset)
- Experimental Thoroughness: ⭐⭐⭐⭐ (comprehensive comparisons, detailed ablations)
- Writing Quality: ⭐⭐⭐⭐ (clear structure, intuitive figures)
- Value: ⭐⭐⭐⭐ (new direction for UAV scene understanding, significant contribution to benchmark)